better atomics

Started by Robert Haasabout 12 years ago125 messages
#1Robert Haas
robertmhaas@gmail.com

On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Partially that only works because we sprinkle 'volatile's on lots of
places. That can actually hurt performance... it'd usually be something
like:
#define pg_atomic_load(uint32) (*(volatile uint32 *)(atomic))

Even if not needed in some places because a variable is already
volatile, it's very helpful in pointing out what happens and where you
need to be careful.

That's fine with me.

* uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val)
* bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval)
* uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add)
* uint32 pg_atomic_fetch_sub_u32(uint32 *ptr, uint32 add)
* uint32 pg_atomic_fetch_and_u32(uint32 *ptr, uint32 add)
* uint32 pg_atomic_fetch_or_u32(uint32 *ptr, uint32 add)

Do we really need all of those? Compare-and-swap is clearly needed,
and fetch-and-add is also very useful. I'm not sure about the rest.
Not that I necessarily object to having them, but it will be a lot
more work.

Well, _sub can clearly be implemented with _add generically. I find code
using _sub much easier to read than _add(-whatever), but that's maybe
just me.

Yeah, I wouldn't bother with _sub.

But _and, _or are really useful because they can be used to atomically
clear and set flag bits.

Until we have some concrete application that requires this
functionality for adequate performance, I'd be inclined not to bother.
I think we have bigger fish to fry, and (to extend my nautical
metaphor) trying to get this working everywhere might amount to trying
to boil the ocean.

* u64 variants of the above
* bool pg_atomic_test_set(void *ptr)
* void pg_atomic_clear(void *ptr)

What are the intended semantics here?

It's basically TAS() without defining whether setting a value that needs
to be set is a 1 or a 0. Gcc provides this and I think we should make
our spinlock implementation use it if there is no special cased
implementation available.

I'll reserve judgement until I see the patch.

More general question: how do we move the ball down the field in this
area? I'm willing to do some of the legwork, but I doubt I can do all
of it, and I really think we need to make some progress here soon, as
it seems that you and I and Ants are all running into the same
problems in slightly different ways. What's our strategy for getting
something done here?

That's a pretty good question.

I am willing to write the gcc implementation, plus the generic wrappers
and provide the infrastructure to override it per-platform. I won't be
able to do anything about non-linux, non-gcc platforms in that timeframe
though.

I was thinking of something like:
include/storage/atomic.h
include/storage/atomic-gcc.h
include/storage/atomic-msvc.h
include/storage/atomic-acc-ia64.h
where atomic.h first has a list of #ifdefs including the specialized
files and then lots of #ifndef providing generic variants if not
already provided by the platorm specific file.

Seems like we might want to put some of this in one of the various
directories called "port", or maybe a new subdirectory of one of them.
This basically sounds sane, but I'm unwilling to commit to dropping
support for all obscure platforms we currently support unless there
are about 500 people +1ing the idea, so I think we need to think about
what happens when we don't have platform support for these primitives.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#1)
Re: better atomics

On 2013-10-16 14:30:34 -0400, Robert Haas wrote:

But _and, _or are really useful because they can be used to atomically
clear and set flag bits.

Until we have some concrete application that requires this
functionality for adequate performance, I'd be inclined not to bother.
I think we have bigger fish to fry, and (to extend my nautical
metaphor) trying to get this working everywhere might amount to trying
to boil the ocean.

Well, I actually have a very, very preliminary patch using them in the
course of removing the spinlocks in PinBuffer() (via LockBufHdr()).

I think it's also going to be easier to add all required atomics at once
than having to go over several platforms in independent feature
patches adding atomics one by one.

I was thinking of something like:
include/storage/atomic.h
include/storage/atomic-gcc.h
include/storage/atomic-msvc.h
include/storage/atomic-acc-ia64.h
where atomic.h first has a list of #ifdefs including the specialized
files and then lots of #ifndef providing generic variants if not
already provided by the platorm specific file.

Seems like we might want to put some of this in one of the various
directories called "port", or maybe a new subdirectory of one of them.

Hm. I'd have to be src/include/port, right? Not sure if putting some
files there will be cleaner, but I don't mind it either.

This basically sounds sane, but I'm unwilling to commit to dropping
support for all obscure platforms we currently support unless there
are about 500 people +1ing the idea, so I think we need to think about
what happens when we don't have platform support for these primitives.

I think the introduction of the atomics really isn't much dependent on
the removal. The only thing I'd touch around platforms in that patch is
adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
remove the special case usages of __sync_lock_test_and_set() (Arm64,
some 32 bit arms).
It's the patches that would like to rely on atomics that will have the
problem :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: better atomics

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-10-16 14:30:34 -0400, Robert Haas wrote:

But _and, _or are really useful because they can be used to atomically
clear and set flag bits.

Until we have some concrete application that requires this
functionality for adequate performance, I'd be inclined not to bother.
I think we have bigger fish to fry, and (to extend my nautical
metaphor) trying to get this working everywhere might amount to trying
to boil the ocean.

Well, I actually have a very, very preliminary patch using them in the
course of removing the spinlocks in PinBuffer() (via LockBufHdr()).

I think we need to be very, very wary of making our set of required
atomics any larger than it absolutely has to be. The more stuff we
require, the closer we're going to be to making PG a program that only
runs well on Intel. I am not comforted by the "gcc will provide good
implementations of the atomics everywhere" argument, because in fact
it won't. See for example the comment associated with our only existing
use of gcc atomics:

* On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if
* not fall back on the SWPB instruction. SWPB does not work on ARMv6 or
* later, so the compiler builtin is preferred if available. Note also that
* the int-width variant of the builtin works on more chips than other widths.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's not a track record that should inspire much faith that complete
sets of atomics will exist elsewhere. What's more, we don't just need
atomics that *work*, we need atomics that are *fast*, else the whole
exercise turns into pessimization not improvement. The more atomic ops
we depend on, the more likely it is that some of them will involve kernel
support on some chips, destroying whatever performance improvement we
hoped to get.

... The only thing I'd touch around platforms in that patch is
adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
remove the special case usages of __sync_lock_test_and_set() (Arm64,
some 32 bit arms).

Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be
about one line long. The above quote seems to me to be exactly the kind
of optimism that is not warranted in this area.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: better atomics

On 2013-10-16 22:39:07 +0200, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-10-16 14:30:34 -0400, Robert Haas wrote:

But _and, _or are really useful because they can be used to atomically
clear and set flag bits.

Until we have some concrete application that requires this
functionality for adequate performance, I'd be inclined not to bother.
I think we have bigger fish to fry, and (to extend my nautical
metaphor) trying to get this working everywhere might amount to trying
to boil the ocean.

Well, I actually have a very, very preliminary patch using them in the
course of removing the spinlocks in PinBuffer() (via LockBufHdr()).

I think we need to be very, very wary of making our set of required
atomics any larger than it absolutely has to be. The more stuff we
require, the closer we're going to be to making PG a program that only
runs well on Intel.

Well, essentially I am advocating to support basically three operations:
* compare and swap
* atomic add (and by that sub)
* test and set

The other operations are just porcelain around that.

With compare and swap both the others can be easily implemented if
neccessary.

Note that e.g. linux - running on all platforms we're talking about but
VAX - exensively and unconditionally uses atomic operations widely. So I
think we don't have to be too afraid about non-intel performance here.

I am not comforted by the "gcc will provide good
implementations of the atomics everywhere" argument, because in fact
it won't. See for example the comment associated with our only existing
use of gcc atomics:

* On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if
* not fall back on the SWPB instruction. SWPB does not work on ARMv6 or
* later, so the compiler builtin is preferred if available. Note also that
* the int-width variant of the builtin works on more chips than other widths.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's not a track record that should inspire much faith that complete
sets of atomics will exist elsewhere. What's more, we don't just need
atomics that *work*, we need atomics that are *fast*, else the whole
exercise turns into pessimization not improvement. The more atomic ops
we depend on, the more likely it is that some of them will involve kernel
support on some chips, destroying whatever performance improvement we
hoped to get.

Hm. I can't really follow. We *prefer* to use __sync_lock_test_and_set
in contrast to our own open-coded implementation, right? And the comment
about some hardware only supporting "int-width" matches that I only want to
require u32 atomics support, right?

I completely agree that we cannot rely on 8byte math or similar (16byte
cmpxchg) to be supported by all platforms. That indeed would require
kernel fallbacks on e.g. ARM.

... The only thing I'd touch around platforms in that patch is
adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
remove the special case usages of __sync_lock_test_and_set() (Arm64,
some 32 bit arms).

Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be
about one line long. The above quote seems to me to be exactly the kind
of optimism that is not warranted in this area.

Well, I am somewhat hesitant to change s_lock for more platforms than
necessary. So I proposed to restructure it in a way that leaves all
existing implementations in place that do not already rely on
__sync_lock_test_and_set().

There's also SPIN_DELAY btw, which I don't see any widely provided
intrinsics for.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#1)
2 attachment(s)
Re: better atomics - v0.2

Hi,

This is the first real submission of how I think our atomics
architecture should look like. It's definitely a good bit from complete,
please view it from that POV!
The most important thing I need at this point is feedback whether the
API in general looks ok and what should be changed, before I spend the
time to implement it nicely everywhere.

The API is loosely modeled to match C11's atomics
http://en.cppreference.com/w/c/atomic library, at least to the degree
that the latter could be used to implement pg's atomics.

Please note:
* Only gcc's implementation is tested. I'd be really surprised if it
compiled on anything else.
* I've currently replaced nearly all of s_lock.h to use parts of the
atomics API. We might rather use it's contents to implement atomics
'flags' on some of the wierder platforms out there.
* I've integrated barrier.h into the atomics code. It's pretty much a
required part, and compiler/arch specific so that seems to make sense.
* 'autoreconf' needs to be run after applying the patchset. I got some
problems with running autoconf 2.63 here, using 2.69 makes the diff to
big.

Questions:
* What do you like/dislike about the API (storage/atomics.h)
* decide whether it's ok to rely on inline functions or whether we need
to provide out-of-line versions. I think we should just bite the
bullet and require support. Some very old arches might need to live with
warnings. Patch 01 tries to silence them on HP-UX.
* decide whether we want to keep the semaphore fallback for
spinlocks. I strongly vote for removing it. The patchset provides
compiler based default implementations for atomics, so it's unlikely
to be helpful when bringing up a new platform.

TODO:
* actually test on other platforms that new gcc
* We should probably move most of the (documentation) content of
s_lock.h somewhere else.
* Integrate the spinlocks based fallback for platforms that only provide
something like TAS().

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Try-to-make-HP-UX-s-ac-not-cause-warnings-about-unus.patchtext/x-patch; charset=us-asciiDownload
>From 71e929cb4de63f060464feb9d73cad1161807253 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 15 Nov 2013 16:52:02 +0100
Subject: [PATCH 1/4] Try to make HP-UX's ac++ not cause warnings about unused
 static inlines.

---
 src/template/hpux | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/template/hpux b/src/template/hpux
index ce4d93c..6349ae5 100644
--- a/src/template/hpux
+++ b/src/template/hpux
@@ -3,7 +3,9 @@
 CPPFLAGS="$CPPFLAGS -D_XOPEN_SOURCE_EXTENDED"
 
 if test "$GCC" != yes ; then
-  CC="$CC -Ae"
+  # -Ae enables C89/C99
+  # +W2177 should disable warnings about unused static inlines
+  CC="$CC -Ae +W2177"
   CFLAGS="+O2"
 fi
 
-- 
1.8.5.rc1.dirty

0002-Very-basic-atomic-ops-implementation.patchtext/x-patch; charset=iso-8859-1Download
From ea80fc0c988b1d51816b2fd4b9d2bba2e7aead05 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 6 Nov 2013 10:32:53 +0100
Subject: [PATCH 2/4] Very basic atomic ops implementation

Only gcc has been tested, stub implementations for
* msvc
* HUPX acc
* IBM xlc
* Sun/Oracle Sunpro compiler
are included.

All implementations only provide the bare minimum of operations for
now and rely on emulating more complex operations via
compare_and_swap().

Most of s_lock.h is removed and replaced by usage of the atomics ops.
---
 config/c-compiler.m4                             |  77 ++
 configure.in                                     |  19 +-
 src/backend/storage/lmgr/Makefile                |   2 +-
 src/backend/storage/lmgr/atomics.c               |  26 +
 src/backend/storage/lmgr/spin.c                  |   8 -
 src/include/c.h                                  |   7 +
 src/include/storage/atomics-arch-alpha.h         |  25 +
 src/include/storage/atomics-arch-amd64.h         |  56 ++
 src/include/storage/atomics-arch-arm.h           |  29 +
 src/include/storage/atomics-arch-hppa.h          |  17 +
 src/include/storage/atomics-arch-i386.h          |  96 +++
 src/include/storage/atomics-arch-ia64.h          |  27 +
 src/include/storage/atomics-arch-ppc.h           |  25 +
 src/include/storage/atomics-generic-acc.h        |  99 +++
 src/include/storage/atomics-generic-gcc.h        | 174 +++++
 src/include/storage/atomics-generic-msvc.h       |  67 ++
 src/include/storage/atomics-generic-sunpro.h     |  66 ++
 src/include/storage/atomics-generic-xlc.h        |  84 +++
 src/include/storage/atomics-generic.h            | 402 +++++++++++
 src/include/storage/atomics.h                    | 543 ++++++++++++++
 src/include/storage/barrier.h                    | 137 +---
 src/include/storage/s_lock.h                     | 880 +----------------------
 src/test/regress/expected/lock.out               |   8 +
 src/test/regress/input/create_function_1.source  |   5 +
 src/test/regress/output/create_function_1.source |   4 +
 src/test/regress/regress.c                       | 213 ++++++
 src/test/regress/sql/lock.sql                    |   5 +
 27 files changed, 2078 insertions(+), 1023 deletions(-)
 create mode 100644 src/backend/storage/lmgr/atomics.c
 create mode 100644 src/include/storage/atomics-arch-alpha.h
 create mode 100644 src/include/storage/atomics-arch-amd64.h
 create mode 100644 src/include/storage/atomics-arch-arm.h
 create mode 100644 src/include/storage/atomics-arch-hppa.h
 create mode 100644 src/include/storage/atomics-arch-i386.h
 create mode 100644 src/include/storage/atomics-arch-ia64.h
 create mode 100644 src/include/storage/atomics-arch-ppc.h
 create mode 100644 src/include/storage/atomics-generic-acc.h
 create mode 100644 src/include/storage/atomics-generic-gcc.h
 create mode 100644 src/include/storage/atomics-generic-msvc.h
 create mode 100644 src/include/storage/atomics-generic-sunpro.h
 create mode 100644 src/include/storage/atomics-generic-xlc.h
 create mode 100644 src/include/storage/atomics-generic.h
 create mode 100644 src/include/storage/atomics.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 4ba3236..c24cb59 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -289,3 +289,80 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_PROG_CC_LDFLAGS_OPT
+
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_ATOMICS],
+[AC_CACHE_CHECK(for __builtin_constant_p, pgac_cv__builtin_constant_p,
+[AC_TRY_COMPILE([static int x; static int y[__builtin_constant_p(x) ? x : 1];],
+[],
+[pgac_cv__builtin_constant_p=yes],
+[pgac_cv__builtin_constant_p=no])])
+if test x"$pgac_cv__builtin_constant_p" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CONSTANT_P, 1,
+          [Define to 1 if your compiler understands __builtin_constant_p.])
+fi])# PGAC_C_BUILTIN_CONSTANT_P
+
+# PGAC_HAVE_GCC__SYNC_INT32_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(),
+# and define HAVE_GCC__SYNC_INT32_TAS
+#
+# NB: There are platforms where test_and_set is available but compare_and_swap
+# is not, so test this separa
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_TAS],
+[AC_CACHE_CHECK(for builtin locking functions, pgac_cv_gcc_sync_int32_tas,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_int32_tas="yes"],
+  [pgac_cv_gcc_sync_int32_tas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 32bit
+# types, and define HAVE_GCC__SYNC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __sync int32 atomic operations, pgac_cv_gcc_sync_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);],
+  [pgac_cv_gcc_sync_int32_cas="yes"],
+  [pgac_cv_gcc_sync_int32_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int *, int, int).])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_CAS
+
+# PGAC_HAVE_GCC__SYNC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 64bit
+# types, and define HAVE_GCC__SYNC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __sync int64 atomic operations, pgac_cv_gcc_sync_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);],
+  [pgac_cv_gcc_sync_int64_cas="yes"],
+  [pgac_cv_gcc_sync_int64_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64).])
+fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+# -------------------------
+
+# Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
+# types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int32_cas="yes"],
+  [pgac_cv_gcc_atomic_int32_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
diff --git a/configure.in b/configure.in
index 304399e..1fab189 100644
--- a/configure.in
+++ b/configure.in
@@ -19,7 +19,7 @@ m4_pattern_forbid(^PGAC_)dnl to catch undefined macros
 
 AC_INIT([PostgreSQL], [9.4devel], [pgsql-bugs@postgresql.org])
 
-m4_if(m4_defn([m4_PACKAGE_VERSION]), [2.63], [], [m4_fatal([Autoconf version 2.63 is required.
+m4_if(m4_defn([m4_PACKAGE_VERSION]), [2.69], [], [m4_fatal([Autoconf version 2.63 is required.
 Untested combinations of 'autoconf' and PostgreSQL versions are not
 recommended.  You can remove the check from 'configure.in' but it is then
 your responsibility whether the result works or not.])])
@@ -1453,17 +1453,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1738,6 +1727,12 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+PGAC_HAVE_GCC__SYNC_INT32_TAS
+PGAC_HAVE_GCC__SYNC_INT32_CAS
+PGAC_HAVE_GCC__SYNC_INT64_CAS
+PGAC_HAVE_GCC__ATOMIC_INT32_CAS
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..be95d50 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/storage/lmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
+OBJS = atomics.o lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/atomics.c b/src/backend/storage/lmgr/atomics.c
new file mode 100644
index 0000000..af95153
--- /dev/null
+++ b/src/backend/storage/lmgr/atomics.c
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.c
+ *	   Non-Inline parts of the atomics implementation
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/lmgr/atomics.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/*
+ * We want the functions below to be inline; but if the compiler doesn't
+ * support that, fall back on providing them as regular functions.	See
+ * STATIC_IF_INLINE in c.h.
+ */
+#define ATOMICS_INCLUDE_DEFINITIONS
+
+#include "storage/atomics.h"
+
+#ifdef PG_USE_ATOMICS_EMULATION
+#endif
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index f054be8..469dd84 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -86,14 +86,6 @@ s_unlock_sema(volatile slock_t *lock)
 	PGSemaphoreUnlock((PGSemaphore) lock);
 }
 
-bool
-s_lock_free_sema(volatile slock_t *lock)
-{
-	/* We don't currently use S_LOCK_FREE anyway */
-	elog(ERROR, "spin.c does not support S_LOCK_FREE()");
-	return false;
-}
-
 int
 tas_sema(volatile slock_t *lock)
 {
diff --git a/src/include/c.h b/src/include/c.h
index 6e19c6d..347f394 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -866,6 +866,13 @@ typedef NameData *Name;
 #define STATIC_IF_INLINE
 #endif   /* PG_USE_INLINE */
 
+#ifdef PG_USE_INLINE
+#define STATIC_IF_INLINE_DECLARE static inline
+#else
+#define STATIC_IF_INLINE_DECLARE extern
+#endif   /* PG_USE_INLINE */
+
+
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
  * ----------------------------------------------------------------
diff --git a/src/include/storage/atomics-arch-alpha.h b/src/include/storage/atomics-arch-alpha.h
new file mode 100644
index 0000000..06b7125
--- /dev/null
+++ b/src/include/storage/atomics-arch-alpha.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to Alpha
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-alpha.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-arch-amd64.h b/src/include/storage/atomics-arch-amd64.h
new file mode 100644
index 0000000..90f4abc
--- /dev/null
+++ b/src/include/storage/atomics-arch-amd64.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-amd64.h
+ *	  Atomic operations considerations specific to intel/amd x86-64
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-amd64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * x86_64 has similar ordering characteristics to i386.
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier()
+#define pg_write_barrier_impl()		pg_compiler_barrier()
+
+#if defined(__GNUC__)
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * Adding a PAUSE in the spin delay loop is demonstrably a no-op on
+	 * Opteron, but it may be of some use on EM64T, so we keep it.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif
+
+#elif defined(WIN32_ONLY_COMPILER)
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#endif
diff --git a/src/include/storage/atomics-arch-arm.h b/src/include/storage/atomics-arch-arm.h
new file mode 100644
index 0000000..a00f8f5
--- /dev/null
+++ b/src/include/storage/atomics-arch-arm.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-arm.h
+ *	  Atomic operations considerations specific to ARM
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-arm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+
+/*
+ * FIXME: Disable 32bit atomics for ARMs before v6 or error out
+ */
+
+/*
+ * 64 bit atomics on arm are implemented using kernel fallbacks and might be
+ * slow, so disable entirely for now.
+ */
+#define PG_DISABLE_64_BIT_ATOMICS
diff --git a/src/include/storage/atomics-arch-hppa.h b/src/include/storage/atomics-arch-hppa.h
new file mode 100644
index 0000000..8913801
--- /dev/null
+++ b/src/include/storage/atomics-arch-hppa.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to HPPA
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-hppa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* HPPA doesn't do either read or write reordering */
+#define pg_memory_barrier_impl()		pg_compiler_barrier_impl()
diff --git a/src/include/storage/atomics-arch-i386.h b/src/include/storage/atomics-arch-i386.h
new file mode 100644
index 0000000..151a14f
--- /dev/null
+++ b/src/include/storage/atomics-arch-i386.h
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-i386.h
+ *	  Atomic operations considerations specific to intel x86
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-i386.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * i386 does not allow loads to be reordered with other loads, or stores to be
+ * reordered with other stores, but a load can be performed before a subsequent
+ * store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ */
+
+#if defined(__INTEL_COMPILER)
+#define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier()
+#define pg_write_barrier_impl()		pg_compiler_barrier()
+
+#if defined(__GNUC__)
+
+/*
+ * Implement TAS if not available as a __sync builtin on very old gcc
+ * versions.
+ */
+#ifndef HAVE_GCC__SYNC_INT32_TAS
+static __inline__ int
+tas(volatile slock_t *lock)
+	register slock_t _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(*lock)
+:
+:		"memory", "cc");
+	return (int) _res;
+}
+#endif
+
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * This sequence is equivalent to the PAUSE instruction ("rep" is
+	 * ignored by old IA32 processors if the following instruction is
+	 * not a string operation); the IA-32 Architecture Software
+	 * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
+	 * PAUSE in the inner loop of a spin lock is necessary for good
+	 * performance:
+	 *
+	 *     The PAUSE instruction improves the performance of IA-32
+	 *     processors supporting Hyper-Threading Technology when
+	 *     executing spin-wait loops and other routines where one
+	 *     thread is accessing a shared lock or semaphore in a tight
+	 *     polling loop. When executing a spin-wait loop, the
+	 *     processor can suffer a severe performance penalty when
+	 *     exiting the loop because it detects a possible memory order
+	 *     violation and flushes the core processor's pipeline. The
+	 *     PAUSE instruction provides a hint to the processor that the
+	 *     code sequence is a spin-wait loop. The processor uses this
+	 *     hint to avoid the memory order violation and prevent the
+	 *     pipeline flush. In addition, the PAUSE instruction
+	 *     de-pipelines the spin-wait loop to prevent it from
+	 *     consuming execution resources excessively.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif
+
+#elif defined(WIN32_ONLY_COMPILER)
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	/* See comment for gcc code. Same code, MASM syntax */
+	__asm rep nop;
+}
+#endif
diff --git a/src/include/storage/atomics-arch-ia64.h b/src/include/storage/atomics-arch-ia64.h
new file mode 100644
index 0000000..2863431
--- /dev/null
+++ b/src/include/storage/atomics-arch-ia64.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ia64.h
+ *	  Atomic operations considerations specific to intel itanium
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ia64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		__mf()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		__asm__ __volatile__ ("mf" : : : "memory")
+#elif defined(__hpux)
+#	define pg_compiler_barrier_impl()	_Asm_sched_fence()
+#	define pg_memory_barrier_impl()		_Asm_mf()
+#endif
diff --git a/src/include/storage/atomics-arch-ppc.h b/src/include/storage/atomics-arch-ppc.h
new file mode 100644
index 0000000..f785c99
--- /dev/null
+++ b/src/include/storage/atomics-arch-ppc.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ppc.h
+ *	  Atomic operations considerations specific to PowerPC
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ppc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-generic-acc.h b/src/include/storage/atomics-generic-acc.h
new file mode 100644
index 0000000..b1c9200
--- /dev/null
+++ b/src/include/storage/atomics-generic-acc.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-acc.h
+ *	  Atomic operations support when using HPs acc on HPUX
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * inline assembly for Itanium-based HP-UX:
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
+ * * Implementing Spinlocks on the Intel ® Itanium ® Architecture and PA-RISC
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
+ *
+ * Itanium only supports a small set of numbers (6, -8, -4, -1, 1, 4, 8, 16)
+ * for atomic add/sub, so we just implement everything but compare_exchange via
+ * the compare_exchange fallbacks in atomics-generic.h.
+ *
+ * src/include/storage/atomics-generic-acc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <machine/sys/inline.h>
+
+/* IA64 always has 32/64 bit atomics */
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define MINOR_FENCE (_Asm_fence) (_UP_CALL_FENCE | _UP_SYS_FENCE | \
+								 _DOWN_CALL_FENCE | _DOWN_SYS_FENCE )
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	/*
+	 * We want a barrier, not just release/acquire semantics.
+	 */
+	_Asm_mf();
+	/*
+	 * Notes:
+	 * DOWN_MEM_FENCE | _UP_MEM_FENCE prevents reordering by the compiler
+	 */
+	current =  _Asm_cmpxchg(_SZ_W, /* word */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	_Asm_mf();
+	current =  _Asm_cmpxchg(_SZ_D, /* doubleword */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#undef MINOR_FENCE
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-gcc.h b/src/include/storage/atomics-generic-gcc.h
new file mode 100644
index 0000000..0cc0b16
--- /dev/null
+++ b/src/include/storage/atomics-generic-gcc.h
@@ -0,0 +1,174 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-gcc.h
+ *	  Atomic operations, implemented using gcc (or compatible) intrinsics.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Legacy __sync Built-in Functions for Atomic Memory Access
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fsync-Builtins.html
+ * * Built-in functions for memory model aware atomic operations
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
+ *
+ * src/include/storage/atomics-generic-gcc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+#define pg_compiler_barrier_impl()	__asm__ __volatile__("" ::: "memory");
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#if !defined(pg_memory_barrier_impl)
+#	if defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
+#	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
+#		define pg_memory_barrier_impl()		__sync_synchronize()
+#	endif
+#endif /* !defined(pg_memory_barrier_impl) */
+
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
+#endif
+
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
+#endif
+
+#ifdef HAVE_GCC__SYNC_INT32_TAS
+typedef struct pg_atomic_flag
+{
+	volatile uint32 value;
+} pg_atomic_flag;
+
+#define PG_ATOMIC_INIT_FLAG(flag) {0}
+#endif
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#ifdef HAVE_GCC__PG_ATOMIC_INT64_CAS
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#	define PG_HAVE_PG_ATOMIC_COMPARE_EXCHANGE_U64
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+#endif /* HAVE_GCC__PG_ATOMIC_INT64_CAS */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_INT32_TAS)
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 ret;
+	/* XXX: only a acquire barrier, make it a full one for now */
+	pg_memory_barrier_impl();
+	/* some platform only support a 1 here */
+	ret = __sync_lock_test_and_set(&ptr->value, 1);
+	return ret == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value;
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: only a release barrier, make it a full one for now */
+	pg_memory_barrier_impl();
+	__sync_lock_release(&ptr->value);
+}
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_INT32_TAS) */
+
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS)
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+
+#elif defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#else
+#	error "strange configuration"
+#endif
+
+#ifdef HAVE_GCC__ATOMIC_INT64_CAS
+
+/* if __atomic is supported for 32 it's also supported for 64bit */
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS)
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+
+#elif defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#else
+#	error "strange configuration"
+#endif
+
+#endif /* HAVE_GCC__ATOMIC_INT64_CAS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-msvc.h b/src/include/storage/atomics-generic-msvc.h
new file mode 100644
index 0000000..724d9ee
--- /dev/null
+++ b/src/include/storage/atomics-generic-msvc.h
@@ -0,0 +1,67 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-msvc.h
+ *	  Atomic operations support when using MSVC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Interlocked Variable Access
+ *   http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx
+ *
+ * src/include/storage/atomics-generic-msvc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <intrin.h>
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/* Should work on both MSVC and Borland. */
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier_impl()	_ReadWriteBarrier()
+
+#ifndef pg_memory_barrier_impl
+#define pg_memory_barrier_impl()		MemoryBarrier()
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange(&ptr->value, newval, expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+/*
+ * FIXME: Only available on Vista/2003 onwards, add test for that.
+ *
+ * Only supported on >486, but we require XP as a minimum baseline, which
+ * doesn't support the 486, so we don't need to add checks for that case.
+ */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange64(&ptr->value, newval, expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-sunpro.h b/src/include/storage/atomics-generic-sunpro.h
new file mode 100644
index 0000000..d512409
--- /dev/null
+++ b/src/include/storage/atomics-generic-sunpro.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-sunpro.h
+ *	  Atomic operations for solaris' CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * manpage for atomic_cas(3C)
+ *   http://www.unix.com/man-page/opensolaris/3c/atomic_cas/
+ *   http://docs.oracle.com/cd/E23824_01/html/821-1465/atomic-cas-3c.html
+ *
+ * src/include/storage/atomics-generic-sunpro.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_32(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_64(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#endif
diff --git a/src/include/storage/atomics-generic-xlc.h b/src/include/storage/atomics-generic-xlc.h
new file mode 100644
index 0000000..6a47fe2
--- /dev/null
+++ b/src/include/storage/atomics-generic-xlc.h
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-xlc.h
+ *	  Atomic operations for IBM's CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Synchronization and atomic built-in functions
+ *   http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/bif_sync.htm
+ *
+ * src/include/storage/atomics-generic-xlc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+
+/* FIXME: check for 64bit mode, only supported there */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	/*
+	 * xlc's documentation tells us:
+	 * "If __compare_and_swap is used as a locking primitive, insert a call to
+	 * the __isync built-in function at the start of any critical sections."
+	 */
+	__isync();
+
+	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
+										 (int)*expected, (int)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	__isync();
+
+	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
+										   (long)*expected, (long)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif /* defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64) */
+
+#endif
diff --git a/src/include/storage/atomics-generic.h b/src/include/storage/atomics-generic.h
new file mode 100644
index 0000000..eff8495
--- /dev/null
+++ b/src/include/storage/atomics-generic.h
@@ -0,0 +1,402 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic.h
+ *	  Atomic operations.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * src/include/storage/atomics-generic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+/* Need a way to implement spinlocks */
+#if !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && \
+	!defined(PG_HAVE_ATOMIC_EXCHANGE_U32) && \
+	!defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#	error "postgres requires atomic ops for locking to be provided"
+#endif
+
+#ifndef pg_memory_barrier_impl
+#	error "postgres requires a memory barrier implementation"
+#endif
+
+#ifndef pg_compiler_barrier_impl
+#	error "postgres requires a compiler barrier implementation"
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ */
+#if !defined(pg_read_barrier_impl)
+#	define pg_read_barrier_impl pg_memory_barrier_impl
+#endif
+#if !defined(pg_read_barrier_impl)
+#	define pg_write_barrier_impl pg_memory_barrier_impl
+#endif
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+#define pg_spin_delay_impl()	((void)0)
+#endif
+
+
+/* provide fallback */
+#ifndef PG_HAS_ATOMIC_TEST_SET_FLAG
+typedef pg_atomic_uint32 pg_atomic_flag;
+#define PG_ATOMIC_INIT_FLAG(flag) {0}
+#endif
+
+#if !defined(PG_DISABLE_64_BIT_ATOMICS) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#	define PG_HAVE_64_BIT_ATOMICS
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifndef PG_HAS_ATOMIC_READ_U32
+#define PG_HAS_ATOMIC_READ_U32
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	return ptr->value;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U32
+#define PG_HAS_ATOMIC_WRITE_U32
+STATIC_IF_INLINE void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	ptr->value = val;
+}
+#endif
+
+/*
+ * provide fallback for test_and_set using atomic_exchange if available
+ */
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_exchange_u32_impl(ptr, &value, 1) == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr, &value);
+}
+
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier();
+	pg_atomic_write_u32_impl(ptr, &value, 0);
+}
+
+/*
+ * provide fallback for test_and_set using atomic_compare_exchange if
+ * available.
+ */
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 value = 0;
+	return pg_atomic_compare_exchange_u32_impl(ptr, &value, 1);
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr, &value);
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier();
+	pg_atomic_write_u32_impl(ptr, &value, 0);
+}
+
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG)
+#	error "No pg_atomic_test_and_set provided"
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) */
+
+
+#ifndef PG_HAS_ATOMIC_INIT_U32
+#define PG_HAS_ATOMIC_INIT_U32
+STATIC_IF_INLINE void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	pg_atomic_write_u32_impl(ptr, val_);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_EXCHANGE_U32
+#define PG_HAS_ATOMIC_EXCHANGE_U32
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_U32
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_SUB_U32
+#define PG_HAS_ATOMIC_FETCH_SUB_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old - sub_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_AND_U32
+#define PG_HAS_ATOMIC_FETCH_AND_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_OR_U32
+#define PG_HAS_ATOMIC_FETCH_OR_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_ADD_FETCH_U32
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_SUB_FETCH_U32
+#define PG_HAS_ATOMIC_SUB_FETCH_U32
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+#define PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_, uint32 until)
+{
+	uint32 old;
+	while (true)
+	{
+		uint32 new_val;
+		old = pg_atomic_read_u32_impl(ptr);
+		if (old >= until)
+			break;
+
+		new_val = old + add_;
+		if (new_val > until)
+			new_val = until;
+
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, new_val))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifdef PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+
+#ifndef PG_HAS_ATOMIC_INIT_U64
+#define PG_HAS_ATOMIC_INIT_U64
+STATIC_IF_INLINE void
+pg_atomic_init_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val_)
+{
+	pg_atomic_write_u64_impl(ptr, val_);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_READ_U64
+#define PG_HAS_ATOMIC_READ_U64
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	return ptr->value;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+STATIC_IF_INLINE void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value = val;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_U64
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_SUB_U64
+#define PG_HAS_ATOMIC_FETCH_SUB_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old - sub_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_AND_U64
+#define PG_HAS_ATOMIC_FETCH_AND_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_OR_U64
+#define PG_HAS_ATOMIC_FETCH_OR_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_ADD_FETCH_U64
+#define PG_HAS_ATOMIC_ADD_FETCH_U64
+STATIC_IF_INLINE
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, add_) + add_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_SUB_FETCH_U64
+#define PG_HAS_ATOMIC_SUB_FETCH_U64
+STATIC_IF_INLINE
+pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#endif /* PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics.h b/src/include/storage/atomics.h
new file mode 100644
index 0000000..39ed336
--- /dev/null
+++ b/src/include/storage/atomics.h
@@ -0,0 +1,543 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.h
+ *	  Atomic operations.
+ *
+ * Hardware and compiler dependent functions for manipulating memory
+ * atomically and dealing with cache coherency. Used to implement
+ * locking facilities and replacements.
+ *
+ * To bring up postgres on a platform/compiler at the very least
+ * either one of
+ * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
+ * * pg_atomic_compare_exchange_u32()
+ * * pg_atomic_exchange_u32()
+ * and
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * need to be implemented. There exist generic, hardware independent,
+ * implementations for several compilers which might be sufficient,
+ * although possibly not optimal, for a new platform.
+ *
+ * Use higher level functionality (lwlocks, spinlocks, heavyweight
+ * locks) whenever possible. Writing correct code using these
+ * facilities is hard.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomics.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMICS_H
+#define ATOMICS_H
+
+#define INSIDE_ATOMICS_H
+
+/*
+ * Architecture specific implementations/considerations.
+ */
+#if defined(__arm__) || defined(__arm) || \
+	defined(__aarch64__) || defined(__aarch64)
+#	include "storage/atomics-arch-arm.h"
+#endif
+
+#if defined(__i386__) || defined(__i386)
+#	include "storage/atomics-arch-i368.h"
+#elif defined(__x86_64__)
+#	include "storage/atomics-arch-amd64.h"
+#elif defined(__ia64__) || defined(__ia64)
+#	include "storage/atomics-arch-amd64.h"
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+#	include "storage/atomics-arch-ppc.h"
+#elif defined(__alpha) || defined(__alpha__)
+#	include "storage/atomics-arch-alpha.h"
+#elif defined(__hppa) || defined(__hppa__)
+#	include "storage/atomics-arch-hppa.h"
+#endif
+
+/*
+ * Compiler specific, but architecture independent implementations
+ */
+
+/* gcc or compatible, including icc */
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
+#	include "storage/atomics-generic-gcc.h"
+#elif defined(WIN32_ONLY_COMPILER)
+#	include "storage/atomics-generic-msvc.h"
+#elif defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
+#	include "storage/atomics-generic-acc.h"
+#elif defined(__SUNPRO_C) && defined(__ia64) && !defined(__GNUC__)
+#	include "storage/atomics-generic-sunpro.h"
+#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
+#	include "storage/atomics-generic-xlc.h"
+#else
+#	error "unsupported compiler"
+#endif
+
+/* if we have spinlocks, but not atomic ops, emulate them */
+#if defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && \
+	!defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+#	define PG_USE_ATOMICS_EMULATION
+#	include "storage/atomics-generic-spinlock.h"
+#endif
+
+
+/*
+ * Provide fallback implementations for operations that aren't provided if
+ * possible.
+ */
+#include "storage/atomics-generic.h"
+
+/*
+ * pg_compiler_barrier - prevent the compiler from moving code
+ *
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ */
+#define pg_compiler_barrier()	pg_compiler_barrier_impl()
+
+/*
+ * pg_memory_barrier - prevent the CPU from reordering memory access
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ */
+#define pg_memory_barrier()	pg_memory_barrier_impl()
+
+/*
+ * pg_(read|write)_barrier - prevent the CPU from reordering memory access
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.	Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+#define pg_read_barrier()	pg_read_barrier_impl()
+#define pg_write_barrier()	pg_write_barrier_impl()
+
+/*
+ * Spinloop delay -
+ */
+#define pg_spin_delay()	pg_spin_delay_impl()
+
+
+/*
+ * pg_atomic_init_flag - initialize lock
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_test_and_set_flag - TAS()
+ *
+ * Full barrier semantics. (XXX? Only acquire?)
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_unlocked_test_flag - TAS()
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_clear_flag - release lock set by TAS()
+ *
+ * Full barrier semantics. (XXX? Only release?)
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_init_u32 - initialize atomic variable
+ *
+ * Has to be done before any usage except when the atomic variable is declared
+ * statically in which case the variable is initialized to 0.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_read_u32 - read from atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr);
+
+/*
+ * pg_atomic_write_u32 - write to atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_exchange_u32 - exchange newval with current value
+ *
+ * Returns the old value of 'ptr' before the swap.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval);
+
+/*
+ * pg_atomic_compare_exchange_u32 - CAS operation
+ *
+ * Atomically compare the current value of ptr with *expected and store newval
+ * iff ptr and *expected have the same value. The current value of *ptr will
+ * always be stored in *expected.
+ *
+ * Return whether the values have been exchanged.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval);
+
+/*
+ * pg_atomic_fetch_add_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, uint32 add_);
+
+/*
+ * pg_atomic_fetch_sub_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_);
+
+/*
+ * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_);
+
+/*
+ * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_);
+
+/*
+ * pg_atomic_add_fetch_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 add_);
+
+/*
+ * pg_atomic_sub_fetch_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_);
+
+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, uint32 add_,
+							  uint32 until);
+
+/* ----
+ * The 64 bit operations have the same semantics as their 32bit counterparts if
+ * they are available.
+ * ---
+ */
+#ifdef PG_HAVE_64_BIT_ATOMICS
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val_);
+
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr);
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint32 *expected, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, uint64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, uint64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_);
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+/*
+ * The following functions are wrapper functions around the platform specific
+ * implementation of the atomic operations performing common checks.
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))
+
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_clear_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_test_set_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_unlocked_test_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_init_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_write_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u32_impl(ptr);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	return pg_atomic_exchange_u32_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	CHECK_POINTER_ALIGNMENT(expected, 4);
+
+	return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_and_u32_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_or_u32_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_add_fetch_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, uint32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+
+#ifdef PG_HAVE_64_BIT_ATOMICS
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_init_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u64_impl(ptr);
+}
+
+STATIC_IF_INLINE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_write_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	return pg_atomic_exchange_u64_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	CHECK_POINTER_ALIGNMENT(expected, 8);
+	return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_add_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_and_u64_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_or_u64_impl(ptr, or_);
+}
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#undef INSIDE_ATOMICS_H
+
+#endif /* ATOMICS_H */
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index 16a9a0c..607a3c9 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -13,137 +13,12 @@
 #ifndef BARRIER_H
 #define BARRIER_H
 
-#include "storage/s_lock.h"
-
-extern slock_t dummy_spinlock;
-
-/*
- * A compiler barrier need not (and preferably should not) emit any actual
- * machine code, but must act as an optimization fence: the compiler must not
- * reorder loads or stores to main memory around the barrier.  However, the
- * CPU may still reorder loads or stores at runtime, if the architecture's
- * memory model permits this.
- *
- * A memory barrier must act as a compiler barrier, and in addition must
- * guarantee that all loads and stores issued prior to the barrier are
- * completed before any loads or stores issued after the barrier.  Unless
- * loads and stores are totally ordered (which is not the case on most
- * architectures) this requires issuing some sort of memory fencing
- * instruction.
- *
- * A read barrier must act as a compiler barrier, and in addition must
- * guarantee that any loads issued prior to the barrier are completed before
- * any loads issued after the barrier.	Similarly, a write barrier acts
- * as a compiler barrier, and also orders stores.  Read and write barriers
- * are thus weaker than a full memory barrier, but stronger than a compiler
- * barrier.  In practice, on machines with strong memory ordering, read and
- * write barriers may require nothing more than a compiler barrier.
- *
- * For an introduction to using memory barriers within the PostgreSQL backend,
- * see src/backend/storage/lmgr/README.barrier
- */
-
-#if defined(DISABLE_BARRIERS)
-
-/*
- * Fall through to the spinlock-based implementation.
- */
-#elif defined(__INTEL_COMPILER)
-
-/*
- * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
- */
-#if defined(__ia64__) || defined(__ia64)
-#define pg_memory_barrier()		__mf()
-#elif defined(__i386__) || defined(__x86_64__)
-#define pg_memory_barrier()		_mm_mfence()
-#endif
-
-#define pg_compiler_barrier()	__memory_barrier()
-#elif defined(__GNUC__)
-
-/* This works on any architecture, since it's only talking to GCC itself. */
-#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
-
-#if defined(__i386__)
-
 /*
- * i386 does not allow loads to be reordered with other loads, or stores to be
- * reordered with other stores, but a load can be performed before a subsequent
- * store.
- *
- * "lock; addl" has worked for longer than "mfence".
+ * This used to be a separate file, full of compiler/architecture
+ * dependent defines, but it's not included in the atomics.h
+ * infrastructure and just kept for backward compatibility.
  */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__x86_64__)		/* 64 bit x86 */
-
-/*
- * x86_64 has similar ordering characteristics to i386.
- *
- * Technically, some x86-ish chips support uncached memory access and/or
- * special instructions that are weakly ordered.  In those cases we'd need
- * the read and write barriers to be lfence and sfence.  But since we don't
- * do those things, a compiler barrier should be enough.
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__ia64__) || defined(__ia64)
-
-/*
- * Itanium is weakly ordered, so read and write barriers require a full
- * fence.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-
-/*
- * lwsync orders loads with respect to each other, and similarly with stores.
- * But a load can be performed before a subsequent store, so sync must be used
- * for a full memory barrier.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#elif defined(__alpha) || defined(__alpha__)	/* Alpha */
-
-/*
- * Unlike all other known architectures, Alpha allows dependent reads to be
- * reordered, but we don't currently find it necessary to provide a conditional
- * read barrier to cover that case.  We might need to add that later.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
-#elif defined(__hppa) || defined(__hppa__)		/* HP PA-RISC */
-
-/* HPPA doesn't do either read or write reordering */
-#define pg_memory_barrier()		pg_compiler_barrier()
-#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
-
-/*
- * If we're on GCC 4.1.0 or higher, we should be able to get a memory
- * barrier out of this compiler built-in.  But we prefer to rely on our
- * own definitions where possible, and use this only as a fallback.
- */
-#define pg_memory_barrier()		__sync_synchronize()
-#endif
-#elif defined(__ia64__) || defined(__ia64)
-
-#define pg_compiler_barrier()	_Asm_sched_fence()
-#define pg_memory_barrier()		_Asm_mf()
-#elif defined(WIN32_ONLY_COMPILER)
-
-/* Should work on both MSVC and Borland. */
-#include <intrin.h>
-#pragma intrinsic(_ReadWriteBarrier)
-#define pg_compiler_barrier()	_ReadWriteBarrier()
-#define pg_memory_barrier()		MemoryBarrier()
-#endif
+#include "storage/atomics.h"
 
 /*
  * If we have no memory barrier implementation for this architecture, we
@@ -157,6 +32,10 @@ extern slock_t dummy_spinlock;
  * fence.  But all of our actual implementations seem OK in this regard.
  */
 #if !defined(pg_memory_barrier)
+#include "storage/s_lock.h"
+
+extern slock_t dummy_spinlock;
+
 #define pg_memory_barrier() \
 	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
 #endif
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 7dcd5d9..73aebfa 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -21,10 +21,6 @@
  *	void S_UNLOCK(slock_t *lock)
  *		Unlock a previously acquired lock.
  *
- *	bool S_LOCK_FREE(slock_t *lock)
- *		Tests if the lock is free. Returns TRUE if free, FALSE if locked.
- *		This does *not* change the state of the lock.
- *
  *	void SPIN_DELAY(void)
  *		Delay operation to occur inside spinlock wait loop.
  *
@@ -94,879 +90,17 @@
 #ifndef S_LOCK_H
 #define S_LOCK_H
 
+#include "storage/atomics.h"
 #include "storage/pg_sema.h"
 
-#ifdef HAVE_SPINLOCKS	/* skip spinlocks if requested */
-
-
-#if defined(__GNUC__) || defined(__INTEL_COMPILER)
-/*************************************************************************
- * All the gcc inlines
- * Gcc consistently defines the CPU as __cpu__.
- * Other compilers use __cpu or __cpu__ so we test for both in those cases.
- */
-
-/*----------
- * Standard gcc asm format (assuming "volatile slock_t *lock"):
-
-	__asm__ __volatile__(
-		"	instruction	\n"
-		"	instruction	\n"
-		"	instruction	\n"
-:		"=r"(_res), "+m"(*lock)		// return register, in/out lock value
-:		"r"(lock)					// lock pointer, in input register
-:		"memory", "cc");			// show clobbered registers here
-
- * The output-operands list (after first colon) should always include
- * "+m"(*lock), whether or not the asm code actually refers to this
- * operand directly.  This ensures that gcc believes the value in the
- * lock variable is used and set by the asm code.  Also, the clobbers
- * list (after third colon) should always include "memory"; this prevents
- * gcc from thinking it can cache the values of shared-memory fields
- * across the asm code.  Add "cc" if your asm code changes the condition
- * code register, and also list any temp registers the code uses.
- *----------
- */
-
-
-#ifdef __i386__		/* 32-bit i386 */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	/*
-	 * Use a non-locking test before asserting the bus lock.  Note that the
-	 * extra test appears to be a small loss on some x86 platforms and a small
-	 * win on others; it's by no means clear that we should keep it.
-	 *
-	 * When this was last tested, we didn't have separate TAS() and TAS_SPIN()
-	 * macros.  Nowadays it probably would be better to do a non-locking test
-	 * in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done the
-	 * testing to verify that.  Without some empirical evidence, better to
-	 * leave it alone.
-	 */
-	__asm__ __volatile__(
-		"	cmpb	$0,%1	\n"
-		"	jne		1f		\n"
-		"	lock			\n"
-		"	xchgb	%0,%1	\n"
-		"1: \n"
-:		"+q"(_res), "+m"(*lock)
-:
-:		"memory", "cc");
-	return (int) _res;
-}
-
-#define SPIN_DELAY() spin_delay()
-
-static __inline__ void
-spin_delay(void)
-{
-	/*
-	 * This sequence is equivalent to the PAUSE instruction ("rep" is
-	 * ignored by old IA32 processors if the following instruction is
-	 * not a string operation); the IA-32 Architecture Software
-	 * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
-	 * PAUSE in the inner loop of a spin lock is necessary for good
-	 * performance:
-	 *
-	 *     The PAUSE instruction improves the performance of IA-32
-	 *     processors supporting Hyper-Threading Technology when
-	 *     executing spin-wait loops and other routines where one
-	 *     thread is accessing a shared lock or semaphore in a tight
-	 *     polling loop. When executing a spin-wait loop, the
-	 *     processor can suffer a severe performance penalty when
-	 *     exiting the loop because it detects a possible memory order
-	 *     violation and flushes the core processor's pipeline. The
-	 *     PAUSE instruction provides a hint to the processor that the
-	 *     code sequence is a spin-wait loop. The processor uses this
-	 *     hint to avoid the memory order violation and prevent the
-	 *     pipeline flush. In addition, the PAUSE instruction
-	 *     de-pipelines the spin-wait loop to prevent it from
-	 *     consuming execution resources excessively.
-	 */
-	__asm__ __volatile__(
-		" rep; nop			\n");
-}
-
-#endif	 /* __i386__ */
-
-
-#ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-/*
- * On Intel EM64T, it's a win to use a non-locking test before the xchg proper,
- * but only when spinning.
- *
- * See also Implementing Scalable Atomic Locks for Multi-Core Intel(tm) EM64T
- * and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
- * available at:
- * http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
- */
-#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	__asm__ __volatile__(
-		"	lock			\n"
-		"	xchgb	%0,%1	\n"
-:		"+q"(_res), "+m"(*lock)
-:
-:		"memory", "cc");
-	return (int) _res;
-}
-
-#define SPIN_DELAY() spin_delay()
-
-static __inline__ void
-spin_delay(void)
-{
-	/*
-	 * Adding a PAUSE in the spin delay loop is demonstrably a no-op on
-	 * Opteron, but it may be of some use on EM64T, so we keep it.
-	 */
-	__asm__ __volatile__(
-		" rep; nop			\n");
-}
-
-#endif	 /* __x86_64__ */
-
-
-#if defined(__ia64__) || defined(__ia64)
-/*
- * Intel Itanium, gcc or Intel's compiler.
- *
- * Itanium has weak memory ordering, but we rely on the compiler to enforce
- * strict ordering of accesses to volatile data.  In particular, while the
- * xchg instruction implicitly acts as a memory barrier with 'acquire'
- * semantics, we do not have an explicit memory fence instruction in the
- * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
- * trust that the compiler marks the generated store instruction with the
- * ".rel" opcode.
- *
- * Testing shows that assumption to hold on gcc, although I could not find
- * any explicit statement on that in the gcc manual.  In Intel's compiler,
- * the -m[no-]serialize-volatile option controls that, and testing shows that
- * it is enabled by default.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-/* On IA64, it's a win to use a non-locking test before the xchg proper */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-#ifndef __INTEL_COMPILER
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	long int	ret;
-
-	__asm__ __volatile__(
-		"	xchg4 	%0=%1,%2	\n"
-:		"=r"(ret), "+m"(*lock)
-:		"r"(1)
-:		"memory");
-	return (int) ret;
-}
-
-#else /* __INTEL_COMPILER */
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	int		ret;
-
-	ret = _InterlockedExchange(lock,1);	/* this is a xchg asm macro */
-
-	return ret;
-}
-
-#endif /* __INTEL_COMPILER */
-#endif	 /* __ia64__ || __ia64 */
-
-
-/*
- * On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if
- * not fall back on the SWPB instruction.  SWPB does not work on ARMv6 or
- * later, so the compiler builtin is preferred if available.  Note also that
- * the int-width variant of the builtin works on more chips than other widths.
- */
-#if defined(__arm__) || defined(__arm)
-#define HAS_TEST_AND_SET
-
-#define TAS(lock) tas(lock)
-
-#ifdef HAVE_GCC_INT_ATOMICS
-
-typedef int slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	return __sync_lock_test_and_set(lock, 1);
-}
-
-#define S_UNLOCK(lock) __sync_lock_release(lock)
-
-#else /* !HAVE_GCC_INT_ATOMICS */
-
-typedef unsigned char slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	__asm__ __volatile__(
-		"	swpb 	%0, %0, [%2]	\n"
-:		"+r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return (int) _res;
-}
-
-#endif	 /* HAVE_GCC_INT_ATOMICS */
-#endif	 /* __arm__ */
-
-
-/*
- * On ARM64, we use __sync_lock_test_and_set(int *, int) if available.
- */
-#if defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC_INT_ATOMICS
-#define HAS_TEST_AND_SET
-
-#define TAS(lock) tas(lock)
-
-typedef int slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	return __sync_lock_test_and_set(lock, 1);
-}
-
-#define S_UNLOCK(lock) __sync_lock_release(lock)
-
-#endif	 /* HAVE_GCC_INT_ATOMICS */
-#endif	 /* __aarch64__ */
-
-
-/* S/390 and S/390x Linux (32- and 64-bit zSeries) */
-#if defined(__s390__) || defined(__s390x__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock)	   tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	int			_res = 0;
-
-	__asm__	__volatile__(
-		"	cs 	%0,%3,0(%2)		\n"
-:		"+d"(_res), "+m"(*lock)
-:		"a"(lock), "d"(1)
-:		"memory", "cc");
-	return _res;
-}
-
-#endif	 /* __s390__ || __s390x__ */
-
-
-#if defined(__sparc__)		/* Sparc */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res;
-
-	/*
-	 *	See comment in /pg/backend/port/tas/solaris_sparc.s for why this
-	 *	uses "ldstub", and that file uses "cas".  gcc currently generates
-	 *	sparcv7-targeted binaries, so "cas" use isn't possible.
-	 */
-	__asm__ __volatile__(
-		"	ldstub	[%2], %0	\n"
-:		"=r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return (int) _res;
-}
-
-#endif	 /* __sparc__ */
-
-
-/* PowerPC */
-#if defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-/* On PPC, it's a win to use a non-locking test before the lwarx */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-/*
- * NOTE: per the Enhanced PowerPC Architecture manual, v1.0 dated 7-May-2002,
- * an isync is a sufficient synchronization barrier after a lwarx/stwcx loop.
- * On newer machines, we can use lwsync instead for better performance.
- */
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	slock_t _t;
-	int _res;
-
-	__asm__ __volatile__(
-#ifdef USE_PPC_LWARX_MUTEX_HINT
-"	lwarx   %0,0,%3,1	\n"
-#else
-"	lwarx   %0,0,%3		\n"
-#endif
-"	cmpwi   %0,0		\n"
-"	bne     1f			\n"
-"	addi    %0,%0,1		\n"
-"	stwcx.  %0,0,%3		\n"
-"	beq     2f         	\n"
-"1:	li      %1,1		\n"
-"	b		3f			\n"
-"2:						\n"
-#ifdef USE_PPC_LWSYNC
-"	lwsync				\n"
-#else
-"	isync				\n"
-#endif
-"	li      %1,0		\n"
-"3:						\n"
-
-:	"=&r"(_t), "=r"(_res), "+m"(*lock)
-:	"r"(lock)
-:	"memory", "cc");
-	return _res;
-}
-
-/*
- * PowerPC S_UNLOCK is almost standard but requires a "sync" instruction.
- * On newer machines, we can use lwsync instead for better performance.
- */
-#ifdef USE_PPC_LWSYNC
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__ ("	lwsync \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-#else
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__ ("	sync \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-#endif /* USE_PPC_LWSYNC */
-
-#endif /* powerpc */
-
-
-/* Linux Motorola 68k */
-#if (defined(__mc68000__) || defined(__m68k__)) && defined(__linux__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int rv;
-
-	__asm__	__volatile__(
-		"	clrl	%0		\n"
-		"	tas		%1		\n"
-		"	sne		%0		\n"
-:		"=d"(rv), "+m"(*lock)
-:
-:		"memory", "cc");
-	return rv;
-}
-
-#endif	 /* (__mc68000__ || __m68k__) && __linux__ */
-
-
-/*
- * VAXen -- even multiprocessor ones
- * (thanks to Tom Ivar Helbekkmo)
- */
-#if defined(__vax__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int	_res;
-
-	__asm__ __volatile__(
-		"	movl 	$1, %0			\n"
-		"	bbssi	$0, (%2), 1f	\n"
-		"	clrl	%0				\n"
-		"1: \n"
-:		"=&r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return _res;
-}
-
-#endif	 /* __vax__ */
-
-#if defined(__alpha) || defined(__alpha__)	/* Alpha */
-/*
- * Correct multi-processor locking methods are explained in section 5.5.3
- * of the Alpha AXP Architecture Handbook, which at this writing can be
- * found at ftp://ftp.netbsd.org/pub/NetBSD/misc/dec-docs/index.html.
- * For gcc we implement the handbook's code directly with inline assembler.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned long slock_t;
-
-#define TAS(lock)  tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res;
-
-	__asm__	__volatile__(
-		"	ldq		$0, %1	\n"
-		"	bne		$0, 2f	\n"
-		"	ldq_l	%0, %1	\n"
-		"	bne		%0, 2f	\n"
-		"	mov		1,  $0	\n"
-		"	stq_c	$0, %1	\n"
-		"	beq		$0, 2f	\n"
-		"	mb				\n"
-		"	br		3f		\n"
-		"2:	mov		1, %0	\n"
-		"3:					\n"
-:		"=&r"(_res), "+m"(*lock)
-:
-:		"memory", "0");
-	return (int) _res;
-}
-
-#define S_UNLOCK(lock)	\
-do \
-{\
-	__asm__ __volatile__ ("	mb \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-
-#endif /* __alpha || __alpha__ */
-
-
-#if defined(__mips__) && !defined(__sgi)	/* non-SGI MIPS */
-/* Note: on SGI we use the OS' mutex ABI, see below */
-/* Note: R10000 processors require a separate SYNC */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register volatile slock_t *_l = lock;
-	register int _res;
-	register int _tmp;
-
-	__asm__ __volatile__(
-		"       .set push           \n"
-		"       .set mips2          \n"
-		"       .set noreorder      \n"
-		"       .set nomacro        \n"
-		"       ll      %0, %2      \n"
-		"       or      %1, %0, 1   \n"
-		"       sc      %1, %2      \n"
-		"       xori    %1, 1       \n"
-		"       or      %0, %0, %1  \n"
-		"       sync                \n"
-		"       .set pop              "
-:		"=&r" (_res), "=&r" (_tmp), "+R" (*_l)
-:
-:		"memory");
-	return _res;
-}
-
-/* MIPS S_UNLOCK is almost standard but requires a "sync" instruction */
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__( \
-		"       .set push           \n" \
-		"       .set mips2          \n" \
-		"       .set noreorder      \n" \
-		"       .set nomacro        \n" \
-		"       sync                \n" \
-		"       .set pop              "); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-
-#endif /* __mips__ && !__sgi */
-
-
-#if defined(__m32r__) && defined(HAVE_SYS_TAS_H)	/* Renesas' M32R */
-#define HAS_TEST_AND_SET
-
-#include <sys/tas.h>
-
-typedef int slock_t;
-
-#define TAS(lock) tas(lock)
-
-#endif /* __m32r__ */
-
-
-#if defined(__sh__)				/* Renesas' SuperH */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int _res;
-
-	/*
-	 * This asm is coded as if %0 could be any register, but actually SuperH
-	 * restricts the target of xor-immediate to be R0.  That's handled by
-	 * the "z" constraint on _res.
-	 */
-	__asm__ __volatile__(
-		"	tas.b @%2    \n"
-		"	movt  %0     \n"
-		"	xor   #1,%0  \n"
-:		"=z"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory", "t");
-	return _res;
-}
-
-#endif	 /* __sh__ */
-
-
-/* These live in s_lock.c, but only for gcc */
-
-
-#if defined(__m68k__) && !defined(__linux__)	/* non-Linux Motorola 68k */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-#endif
-
-
-#endif	/* defined(__GNUC__) || defined(__INTEL_COMPILER) */
-
-
-
-/*
- * ---------------------------------------------------------------------
- * Platforms that use non-gcc inline assembly:
- * ---------------------------------------------------------------------
- */
-
-#if !defined(HAS_TEST_AND_SET)	/* We didn't trigger above, let's try here */
-
-
-#if defined(USE_UNIVEL_CC)		/* Unixware compiler */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock)	tas(lock)
-
-asm int
-tas(volatile slock_t *s_lock)
-{
-/* UNIVEL wants %mem in column 1, so we don't pg_indent this file */
-%mem s_lock
-	pushl %ebx
-	movl s_lock, %ebx
-	movl $255, %eax
-	lock
-	xchgb %al, (%ebx)
-	popl %ebx
-}
-
-#endif	 /* defined(USE_UNIVEL_CC) */
-
-
-#if defined(__alpha) || defined(__alpha__)	/* Tru64 Unix Alpha compiler */
-/*
- * The Tru64 compiler doesn't support gcc-style inline asm, but it does
- * have some builtin functions that accomplish much the same results.
- * For simplicity, slock_t is defined as long (ie, quadword) on Alpha
- * regardless of the compiler in use.  LOCK_LONG and UNLOCK_LONG only
- * operate on an int (ie, longword), but that's OK as long as we define
- * S_INIT_LOCK to zero out the whole quadword.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned long slock_t;
-
-#include <alpha/builtins.h>
-#define S_INIT_LOCK(lock)  (*(lock) = 0)
-#define TAS(lock)		   (__LOCK_LONG_RETRY((lock), 1) == 0)
-#define S_UNLOCK(lock)	   __UNLOCK_LONG(lock)
-
-#endif	 /* __alpha || __alpha__ */
-
-
-#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
-/*
- * HP's PA-RISC
- *
- * See src/backend/port/hpux/tas.c.template for details about LDCWX.  Because
- * LDCWX requires a 16-byte-aligned address, we declare slock_t as a 16-byte
- * struct.  The active word in the struct is whichever has the aligned address;
- * the other three words just sit at -1.
- *
- * When using gcc, we can inline the required assembly code.
- */
-#define HAS_TEST_AND_SET
-
-typedef struct
-{
-	int			sema[4];
-} slock_t;
-
-#define TAS_ACTIVE_WORD(lock)	((volatile int *) (((uintptr_t) (lock) + 15) & ~15))
-
-#if defined(__GNUC__)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	volatile int *lockword = TAS_ACTIVE_WORD(lock);
-	register int lockval;
-
-	__asm__ __volatile__(
-		"	ldcwx	0(0,%2),%0	\n"
-:		"=r"(lockval), "+m"(*lockword)
-:		"r"(lockword)
-:		"memory");
-	return (lockval == 0);
-}
-
-#endif /* __GNUC__ */
-
-#define S_UNLOCK(lock)	(*TAS_ACTIVE_WORD(lock) = -1)
-
-#define S_INIT_LOCK(lock) \
-	do { \
-		volatile slock_t *lock_ = (lock); \
-		lock_->sema[0] = -1; \
-		lock_->sema[1] = -1; \
-		lock_->sema[2] = -1; \
-		lock_->sema[3] = -1; \
-	} while (0)
-
-#define S_LOCK_FREE(lock)	(*TAS_ACTIVE_WORD(lock) != 0)
-
-#endif	 /* __hppa || __hppa__ */
-
-
-#if defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
-/*
- * HP-UX on Itanium, non-gcc compiler
- *
- * We assume that the compiler enforces strict ordering of loads/stores on
- * volatile data (see comments on the gcc-version earlier in this file).
- * Note that this assumption does *not* hold if you use the
- * +Ovolatile=__unordered option on the HP-UX compiler, so don't do that.
- *
- * See also Implementing Spinlocks on the Intel Itanium Architecture and
- * PA-RISC, by Tor Ekqvist and David Graves, for more information.  As of
- * this writing, version 1.0 of the manual is available at:
- * http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
-/* On IA64, it's a win to use a non-locking test before the xchg proper */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-#endif	/* HPUX on IA64, non gcc */
-
-#if defined(_AIX)	/* AIX */
-/*
- * AIX (POWER)
- */
-#define HAS_TEST_AND_SET
-
-#include <sys/atomic_op.h>
-
-typedef int slock_t;
-
-#define TAS(lock)			_check_lock((slock_t *) (lock), 0, 1)
-#define S_UNLOCK(lock)		_clear_lock((slock_t *) (lock), 0)
-#endif	 /* _AIX */
-
-
-/* These are in s_lock.c */
-
-#if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || defined(__sparc__) || defined(__sparc))
-#define HAS_TEST_AND_SET
-
-#if defined(__i386) || defined(__x86_64__) || defined(__sparcv9) || defined(__sparcv8plus)
-typedef unsigned int slock_t;
-#else
-typedef unsigned char slock_t;
-#endif
-
-extern slock_t pg_atomic_cas(volatile slock_t *lock, slock_t with,
-									  slock_t cmp);
-
-#define TAS(a) (pg_atomic_cas((a), 1, 0) != 0)
-#endif
-
-
-#ifdef WIN32_ONLY_COMPILER
-typedef LONG slock_t;
-
-#define HAS_TEST_AND_SET
-#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
-
-#define SPIN_DELAY() spin_delay()
-
-/* If using Visual C++ on Win64, inline assembly is unavailable.
- * Use a _mm_pause instrinsic instead of rep nop.
- */
-#if defined(_WIN64)
-static __forceinline void
-spin_delay(void)
-{
-	_mm_pause();
-}
-#else
-static __forceinline void
-spin_delay(void)
-{
-	/* See comment for gcc code. Same code, MASM syntax */
-	__asm rep nop;
-}
-#endif
-
-#endif
-
-
-#endif	/* !defined(HAS_TEST_AND_SET) */
-
-
-/* Blow up if we didn't have any way to do spinlocks */
-#ifndef HAS_TEST_AND_SET
-#error PostgreSQL does not have native spinlock support on this platform.  To continue the compilation, rerun configure using --disable-spinlocks.  However, performance will be poor.  Please report this to pgsql-bugs@postgresql.org.
-#endif
-
-
-#else	/* !HAVE_SPINLOCKS */
-
-
-/*
- * Fake spinlock implementation using semaphores --- slow and prone
- * to fall foul of kernel limits on number of semaphores, so don't use this
- * unless you must!  The subroutines appear in spin.c.
- */
-typedef PGSemaphoreData slock_t;
-
-extern bool s_lock_free_sema(volatile slock_t *lock);
-extern void s_unlock_sema(volatile slock_t *lock);
-extern void s_init_lock_sema(volatile slock_t *lock);
-extern int	tas_sema(volatile slock_t *lock);
-
-#define S_LOCK_FREE(lock)	s_lock_free_sema(lock)
-#define S_UNLOCK(lock)	 s_unlock_sema(lock)
-#define S_INIT_LOCK(lock)	s_init_lock_sema(lock)
-#define TAS(lock)	tas_sema(lock)
-
-
-#endif	/* HAVE_SPINLOCKS */
-
-
-/*
- * Default Definitions - override these above as needed.
- */
-
-#if !defined(S_LOCK)
+typedef pg_atomic_flag slock_t;
+#define TAS(lock)			(!pg_atomic_test_set_flag(lock))
+#define TAS_SPIN(lock)		(pg_atomic_unlocked_test_flag(lock) ? 1 : TAS(lock))
+#define S_UNLOCK(lock)		pg_atomic_clear_flag(lock)
+#define S_INIT_LOCK(lock)	pg_atomic_init_flag(lock)
+#define SPIN_DELAY()		pg_spin_delay()
 #define S_LOCK(lock) \
 	(TAS(lock) ? s_lock((lock), __FILE__, __LINE__) : 0)
-#endif	 /* S_LOCK */
-
-#if !defined(S_LOCK_FREE)
-#define S_LOCK_FREE(lock)	(*(lock) == 0)
-#endif	 /* S_LOCK_FREE */
-
-#if !defined(S_UNLOCK)
-#define S_UNLOCK(lock)		(*((volatile slock_t *) (lock)) = 0)
-#endif	 /* S_UNLOCK */
-
-#if !defined(S_INIT_LOCK)
-#define S_INIT_LOCK(lock)	S_UNLOCK(lock)
-#endif	 /* S_INIT_LOCK */
-
-#if !defined(SPIN_DELAY)
-#define SPIN_DELAY()	((void) 0)
-#endif	 /* SPIN_DELAY */
-
-#if !defined(TAS)
-extern int	tas(volatile slock_t *lock);		/* in port/.../tas.s, or
-												 * s_lock.c */
-
-#define TAS(lock)		tas(lock)
-#endif	 /* TAS */
-
-#if !defined(TAS_SPIN)
-#define TAS_SPIN(lock)	TAS(lock)
-#endif	 /* TAS_SPIN */
-
 
 /*
  * Platform-independent out-of-line support routines
diff --git a/src/test/regress/expected/lock.out b/src/test/regress/expected/lock.out
index 0d7c3ba..fd27344 100644
--- a/src/test/regress/expected/lock.out
+++ b/src/test/regress/expected/lock.out
@@ -60,3 +60,11 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
+ test_atomic_ops 
+-----------------
+ t
+(1 row)
+
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index aef1518..1fded84 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -57,6 +57,11 @@ CREATE FUNCTION make_tuple_indirect (record)
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
 
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
+
 -- Things that shouldn't work:
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index 9761d12..9926c90 100644
--- a/src/test/regress/output/create_function_1.source
+++ b/src/test/regress/output/create_function_1.source
@@ -51,6 +51,10 @@ CREATE FUNCTION make_tuple_indirect (record)
         RETURNS record
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 3bd8a15..1db996a 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -16,6 +16,7 @@
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/spi.h"
+#include "storage/atomics.h"
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/rel.h"
@@ -40,6 +41,7 @@ extern int	oldstyle_length(int n, text *t);
 extern Datum int44in(PG_FUNCTION_ARGS);
 extern Datum int44out(PG_FUNCTION_ARGS);
 extern Datum make_tuple_indirect(PG_FUNCTION_ARGS);
+extern Datum test_atomic_ops(PG_FUNCTION_ARGS);
 
 #ifdef PG_MODULE_MAGIC
 PG_MODULE_MAGIC;
@@ -829,3 +831,214 @@ make_tuple_indirect(PG_FUNCTION_ARGS)
 
 	PG_RETURN_HEAPTUPLEHEADER(newtup->t_data);
 }
+
+
+static pg_atomic_flag global_flag = PG_ATOMIC_INIT_FLAG();
+
+static void
+test_atomic_flag(void)
+{
+	pg_atomic_flag flag;
+
+	/* check the global flag */
+	if (pg_atomic_unlocked_test_flag(&global_flag))
+		elog(ERROR, "global_flag: unduly set");
+
+	if (!pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: couldn't set");
+
+	if (pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: set spuriously #1");
+
+	pg_atomic_clear_flag(&global_flag);
+
+	if (!pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: couldn't set");
+
+	/* check the local flag */
+	pg_atomic_init_flag(&flag);
+
+	if (pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unduly set");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	if (pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: set spuriously #2");
+
+	pg_atomic_clear_flag(&flag);
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+}
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static void
+test_atomic_uint32(void)
+{
+	pg_atomic_uint32 var;
+	uint32 expected;
+	int i;
+
+	pg_atomic_init_u32(&var, 0);
+
+	if (pg_atomic_read_u32(&var) != 0)
+		elog(ERROR, "atomic_read_u32() #1 wrong");
+
+	pg_atomic_write_u32(&var, 3);
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u32() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u32() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u32(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u32() #1 wrong");
+
+	if (pg_atomic_add_fetch_u32(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u32() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u32(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u32() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 1000; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u32(&var, &expected, 1))
+			break;
+	}
+	if (i == 1000)
+		elog(ERROR, "atomic_compare_exchange_u32() never succeeded");
+	if (pg_atomic_read_u32(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u32() didn't set value properly");
+
+	pg_atomic_write_u32(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u32(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u32() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u32(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u32()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u32(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #1 wrong");
+
+	if (pg_atomic_fetch_and_u32(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #2 wrong: is %u",
+			 pg_atomic_read_u32(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u32(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 */
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static void
+test_atomic_uint64(void)
+{
+	pg_atomic_uint64 var;
+	uint64 expected;
+	int i;
+
+	pg_atomic_init_u64(&var, 0);
+
+	if (pg_atomic_read_u64(&var) != 0)
+		elog(ERROR, "atomic_read_u64() #1 wrong");
+
+	pg_atomic_write_u64(&var, 3);
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "atomic_read_u64() #2 wrong");
+
+	if (pg_atomic_fetch_add_u64(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u64() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u64(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u64() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u64(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u64() #1 wrong");
+
+	if (pg_atomic_add_fetch_u64(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u64() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u64(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u64() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 100; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u64(&var, &expected, 1))
+			break;
+	}
+	if (i == 100)
+		elog(ERROR, "atomic_compare_exchange_u64() never succeeded");
+	if (pg_atomic_read_u64(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u64() didn't set value properly");
+
+	pg_atomic_write_u64(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u64(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u64() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u64(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u64() #2 wrong");
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u64()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u64(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #1 wrong");
+
+	if (pg_atomic_fetch_and_u64(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #2 wrong: is %u",
+			 pg_atomic_read_u64(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u64(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+
+extern Datum test_atomic_ops(PG_FUNCTION_ARGS)
+{
+	test_atomic_flag();
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+	test_atomic_uint32();
+#endif
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+	test_atomic_uint64();
+#endif
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/test/regress/sql/lock.sql b/src/test/regress/sql/lock.sql
index dda212f..567e8bc 100644
--- a/src/test/regress/sql/lock.sql
+++ b/src/test/regress/sql/lock.sql
@@ -64,3 +64,8 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+
+
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
-- 
1.8.5.rc1.dirty

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: better atomics - v0.2

On Fri, Nov 15, 2013 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Questions:
* What do you like/dislike about the API (storage/atomics.h)
* decide whether it's ok to rely on inline functions or whether we need
to provide out-of-line versions. I think we should just bite the
bullet and require support. Some very old arches might need to live with
warnings. Patch 01 tries to silence them on HP-UX.
* decide whether we want to keep the semaphore fallback for
spinlocks. I strongly vote for removing it. The patchset provides
compiler based default implementations for atomics, so it's unlikely
to be helpful when bringing up a new platform.

On point #2, I don't personally know of any systems that I care about
where inlining isn't supported. However, we've gone to quite a bit of
trouble relatively recently to keep things working for platforms where
that is the case, so I feel the need for an obligatory disclaimer: I
will not commit any patch that moves the wood in that area unless
there is massive consensus that this is an acceptable way forward.
Similarly for point #3: Heikki not long ago went to the trouble of
unbreaking --disable-spinlocks, which suggests that there is at least
some interest in continuing to be able to run in that mode. I'm
perfectly OK with the decision that we don't care about that any more,
but I will not be the one to make that decision, and I think it
requires a greater-than-normal level of consensus.

On point #1, I dunno. It looks like a lot of rearrangement to me, and
I'm not really sure what the final form of it is intended to be. I
think it desperately needs a README explaining what the point of all
this is and how to add support for a new platform or compiler if yours
doesn't work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: better atomics - v0.2

Robert Haas <robertmhaas@gmail.com> writes:

On point #2, I don't personally know of any systems that I care about
where inlining isn't supported. However, we've gone to quite a bit of
trouble relatively recently to keep things working for platforms where
that is the case, so I feel the need for an obligatory disclaimer: I
will not commit any patch that moves the wood in that area unless
there is massive consensus that this is an acceptable way forward.
Similarly for point #3: Heikki not long ago went to the trouble of
unbreaking --disable-spinlocks, which suggests that there is at least
some interest in continuing to be able to run in that mode. I'm
perfectly OK with the decision that we don't care about that any more,
but I will not be the one to make that decision, and I think it
requires a greater-than-normal level of consensus.

Hm. Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
/messages/by-id/1384257026.8059.5.camel@vanquo.pezone.net

Maybe that patch needs a bit more discussion. I tend to agree that
significantly moving our portability goalposts shouldn't be undertaken
lightly. On the other hand, it is 2013, and so it seems not unreasonable
to reconsider choices last made more than a dozen years ago.

Here's an idea that might serve as a useful touchstone for making choices:
compiler inadequacies are easier to fix than hardware/OS inadequacies.
Even a dozen years ago, people could get gcc running on any platform of
interest, and I seriously doubt that's less true today than back then.
So if anyone complains "my compiler doesn't support 'const'", we could
reasonably reply "so install gcc". On the other hand, if the complaint is
"my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
to buy a new server.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#7)
Re: better atomics - v0.2

On 11/19/13, 9:57 AM, Tom Lane wrote:

Hm. Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
/messages/by-id/1384257026.8059.5.camel@vanquo.pezone.net

No, that's about const, volatile, #, and memcmp.

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#7)
Re: better atomics - v0.2

On 2013-11-19 09:57:20 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On point #2, I don't personally know of any systems that I care about
where inlining isn't supported. However, we've gone to quite a bit of
trouble relatively recently to keep things working for platforms where
that is the case, so I feel the need for an obligatory disclaimer: I
will not commit any patch that moves the wood in that area unless
there is massive consensus that this is an acceptable way forward.
Similarly for point #3: Heikki not long ago went to the trouble of
unbreaking --disable-spinlocks, which suggests that there is at least
some interest in continuing to be able to run in that mode. I'm
perfectly OK with the decision that we don't care about that any more,
but I will not be the one to make that decision, and I think it
requires a greater-than-normal level of consensus.

Hm. Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
/messages/by-id/1384257026.8059.5.camel@vanquo.pezone.net

Hm. Those don't directly seem to affect our inline tests. Our test is
PGAC_C_INLINE which itself uses AC_C_INLINE and then tests that
unreferenced inlines don't generate warnings.

Maybe that patch needs a bit more discussion. I tend to agree that
significantly moving our portability goalposts shouldn't be undertaken
lightly. On the other hand, it is 2013, and so it seems not unreasonable
to reconsider choices last made more than a dozen years ago.

I don't think there's even platforms out there that don't support const
inlines, just some that unnecessarily generate warnings. But since
builds on those platforms are already littered with warnings, that
doesn't seem something we need to majorly be concerned with.

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

Here's an idea that might serve as a useful touchstone for making choices:
compiler inadequacies are easier to fix than hardware/OS inadequacies.
Even a dozen years ago, people could get gcc running on any platform of
interest, and I seriously doubt that's less true today than back then.
So if anyone complains "my compiler doesn't support 'const'", we could
reasonably reply "so install gcc".

Agreed.

On the other hand, if the complaint is
"my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
to buy a new server.

Agreed. I've am even wondering about 32bit CAS since it's not actually
that hard to emulate it using spinlocks. Certainly less work than
arguing about removing stone-age platforms ;)

There's no fundamental issue with continuing to support semaphore based
spinlocks I am just wondering about the point of supporting that
configuration since it will always yield horrible performance.

The only fundamental thing that I don't immediately see how we can
support is the spinlock based memory barrier since that introduces a
circularity (atomics need barrier, barrier needs spinlocks, spinlock
needs atomics).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On point #2, I don't personally know of any systems that I care about
where inlining isn't supported. However, we've gone to quite a bit of
trouble relatively recently to keep things working for platforms where
that is the case, so I feel the need for an obligatory disclaimer: I
will not commit any patch that moves the wood in that area unless
there is massive consensus that this is an acceptable way forward.
Similarly for point #3: Heikki not long ago went to the trouble of
unbreaking --disable-spinlocks, which suggests that there is at least
some interest in continuing to be able to run in that mode. I'm
perfectly OK with the decision that we don't care about that any more,
but I will not be the one to make that decision, and I think it
requires a greater-than-normal level of consensus.

Hm. Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
/messages/by-id/1384257026.8059.5.camel@vanquo.pezone.net

Maybe that patch needs a bit more discussion. I tend to agree that
significantly moving our portability goalposts shouldn't be undertaken
lightly. On the other hand, it is 2013, and so it seems not unreasonable
to reconsider choices last made more than a dozen years ago.

Sure. inline isn't C89, and to date, we've been quite rigorous about
enforcing C89-compliance, so I'm going to try not to be the guy that
casually breaks that. On the other hand, I haven't personally seen
any systems that don't have inline in a very long time. The first
systems I did development on were not even C89; they were K&R, and you
had to use old-style function declarations. It's probably safe to say
that they didn't support inline, either, though I don't recall trying
it. But the last time I used a system like that was probably in the
early nineties, and it's been a long time since then. However, I
upgrade my hardware about every 2 years, so I'm not the best guy to
say what the oldest stuff people still care about is.

Here's an idea that might serve as a useful touchstone for making choices:
compiler inadequacies are easier to fix than hardware/OS inadequacies.
Even a dozen years ago, people could get gcc running on any platform of
interest, and I seriously doubt that's less true today than back then.
So if anyone complains "my compiler doesn't support 'const'", we could
reasonably reply "so install gcc". On the other hand, if the complaint is
"my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
to buy a new server.

I generally agree with that principle, but I'd offer the caveat that
some people do still have valid reasons for wanting to use non-GCC
compilers. I have been told that HP's aCC produces better results on
Itanium than gcc, for example. Still, I think we could probably make
a list of no more than half-a-dozen compilers that we really care
about - the popular open source ones (is there anything we need to
care about other than gcc and clang?) and the ones supported by large
vendors (IBM's ICC and HP's aCC being the ones that I know about) and
adopt features that are supported by everything on that list.

Another point is that, some releases ago, we began requiring working
64-bit arithmetic as per commit
d15cb38dec01fcc8d882706a3fa493517597c76b. We might want to go through
our list of nominally-supported platforms and see whether any of them
would fail to compile for that reason. Based on the commit message
for the aforementioned commit, such platforms haven't been viable for
PostgreSQL since 8.3, so we're not losing anything by canning them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 10:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On the other hand, if the complaint is
"my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
to buy a new server.

Agreed. I've am even wondering about 32bit CAS since it's not actually
that hard to emulate it using spinlocks. Certainly less work than
arguing about removing stone-age platforms ;)

There's no fundamental issue with continuing to support semaphore based
spinlocks I am just wondering about the point of supporting that
configuration since it will always yield horrible performance.

The only fundamental thing that I don't immediately see how we can
support is the spinlock based memory barrier since that introduces a
circularity (atomics need barrier, barrier needs spinlocks, spinlock
needs atomics).

We've been pretty much assuming for a long time that calling a
function in another translation unit acts as a compiler barrier.
There's a lot of code that isn't actually safe against global
optimization; we assume, for example, that memory accesses can't move
over an LWLockAcquire(), but that's just using spinlocks internally,
and those aren't guaranteed to be compiler barriers, per previous
discussion. So one idea for a compiler barrier is just to define a
function call pg_compiler_barrier() in a file by itself, and make that
the fallback implementation. That will of course fail if someone uses
a globally optimizing compiler, but I think it'd be OK to say that if
you want to do that, you'd better have a real barrier implementation.
Right now, it's probably unsafe regardless.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: better atomics - v0.2

On 2013-11-19 09:12:42 -0500, Robert Haas wrote:

On point #1, I dunno. It looks like a lot of rearrangement to me, and
I'm not really sure what the final form of it is intended to be.

Understandable. I am not that sure what parts we want to rearange
either. We very well could leave barrier.h and s_lock.h untouched and
just maintain the atomics stuff in parallel and switch over at some
later point.
I've mainly switched over s_lock.h to a) get some good coverage of the
atomics code b) make sure the api works fine for it. We could add the
atomics.h based implementation as a fallback for the cases in which no
native implementation is available.

I think it desperately needs a README explaining what the point of all
this is and how to add support for a new platform or compiler if yours
doesn't work.

Ok, I'll work on that. It would be useful to get feedback on the
"frontend" API in atomics.h though. If people are unhappy with the API
it might very well change how we can implement the API on different
architectures.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 10:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-19 09:12:42 -0500, Robert Haas wrote:

On point #1, I dunno. It looks like a lot of rearrangement to me, and
I'm not really sure what the final form of it is intended to be.

Understandable. I am not that sure what parts we want to rearange
either. We very well could leave barrier.h and s_lock.h untouched and
just maintain the atomics stuff in parallel and switch over at some
later point.
I've mainly switched over s_lock.h to a) get some good coverage of the
atomics code b) make sure the api works fine for it. We could add the
atomics.h based implementation as a fallback for the cases in which no
native implementation is available.

I think it desperately needs a README explaining what the point of all
this is and how to add support for a new platform or compiler if yours
doesn't work.

Ok, I'll work on that. It would be useful to get feedback on the
"frontend" API in atomics.h though. If people are unhappy with the API
it might very well change how we can implement the API on different
architectures.

Sure, but if people can't understand the API, then it's hard to decide
whether to be unhappy with it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#8)
Re: better atomics - v0.2

Peter Eisentraut <peter_e@gmx.net> writes:

On 11/19/13, 9:57 AM, Tom Lane wrote:

Hm. Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
/messages/by-id/1384257026.8059.5.camel@vanquo.pezone.net

No, that's about const, volatile, #, and memcmp.

Ah, sorry, not enough caffeine absorbed yet. Still, we should stop
to think about whether this represents an undesirable move of the
portability goalposts. The first three of these are certainly
compiler issues, and I personally don't have a problem with blowing
off compilers that still haven't managed to implement all of C89 :-(.
I'm not clear on which systems had the memcmp issue --- do we have
the full story on that?

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#11)
Re: better atomics - v0.2

On 2013-11-19 10:23:57 -0500, Robert Haas wrote:

The only fundamental thing that I don't immediately see how we can
support is the spinlock based memory barrier since that introduces a
circularity (atomics need barrier, barrier needs spinlocks, spinlock
needs atomics).

We've been pretty much assuming for a long time that calling a
function in another translation unit acts as a compiler barrier.
There's a lot of code that isn't actually safe against global
optimization; we assume, for example, that memory accesses can't move
over an LWLockAcquire(), but that's just using spinlocks internally,
and those aren't guaranteed to be compiler barriers, per previous
discussion. So one idea for a compiler barrier is just to define a
function call pg_compiler_barrier() in a file by itself, and make that
the fallback implementation. That will of course fail if someone uses
a globally optimizing compiler, but I think it'd be OK to say that if
you want to do that, you'd better have a real barrier implementation.

That works for compiler, but not for memory barriers :/

Right now, it's probably unsafe regardless.

Yes, I have pretty little trust in the current state. Both from the
infrastructure perspective (spinlocks, barriers) as from individual
pieces of code. To a good part we're probably primarily protected by
x86's black magic and the fact that everyone with sufficient concurrency
to see problems uses x86.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#15)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-19 10:23:57 -0500, Robert Haas wrote:

The only fundamental thing that I don't immediately see how we can
support is the spinlock based memory barrier since that introduces a
circularity (atomics need barrier, barrier needs spinlocks, spinlock
needs atomics).

We've been pretty much assuming for a long time that calling a
function in another translation unit acts as a compiler barrier.
There's a lot of code that isn't actually safe against global
optimization; we assume, for example, that memory accesses can't move
over an LWLockAcquire(), but that's just using spinlocks internally,
and those aren't guaranteed to be compiler barriers, per previous
discussion. So one idea for a compiler barrier is just to define a
function call pg_compiler_barrier() in a file by itself, and make that
the fallback implementation. That will of course fail if someone uses
a globally optimizing compiler, but I think it'd be OK to say that if
you want to do that, you'd better have a real barrier implementation.

That works for compiler, but not for memory barriers :/

True, but we already assume that a spinlock is a memory barrier minus
a compiler barrier. So if you have a working compiler barrier, you
ought to be able to fix spinlocks to be memory barriers. And then, if
you need a memory barrier for some other purpose, you can always fall
back to acquiring and releasing a spinlock.

Maybe that's too contorted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#14)
Re: better atomics - v0.2

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#9)
Re: better atomics - v0.2

Andres Freund <andres@2ndquadrant.com> writes:

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

Or just not worry about it, if it's only a warning? Or does the warning
mean code bloat (lots of useless copies of the inline function)?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: better atomics - v0.2

On 2013-11-19 10:37:35 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

Or just not worry about it, if it's only a warning? Or does the warning
mean code bloat (lots of useless copies of the inline function)?

I honestly have no idea whether it causes code bloat - I'd be surprised
if it did since it detects that they are unused, but I cannot rule it
out entirely.
The suggested patch - untested since I have no access to HP-UX - just
adds +W2177 to the compiler's commandline in template/hpux which
supposedly suppressed that warning.

I think removing the quiet inline test is a good idea, but that doesn't
preclude fixing the warnings at the same time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#13)
4 attachment(s)
Re: better atomics - v0.2

Hi,

On 2013-11-19 10:27:57 -0500, Robert Haas wrote:

Ok, I'll work on that. It would be useful to get feedback on the
"frontend" API in atomics.h though. If people are unhappy with the API
it might very well change how we can implement the API on different
architectures.

Sure, but if people can't understand the API, then it's hard to decide
whether to be unhappy with it.

Fair point, while there's already an explantory blurb ontop of atomics.h
and the semantics of the functions should all be documented, it doesn't
explain the API on a high level. I've expanded it a bit

The attached patches compile and make check successfully on linux x86,
amd64 and msvc x86 and amd64. This time and updated configure is
included. The majority of operations still rely on CAS emulation.

I've copied the introduction here for easier reading:

/*-------------------------------------------------------------------------
*
* atomics.h
* Generic atomic operations support.
*
* Hardware and compiler dependent functions for manipulating memory
* atomically and dealing with cache coherency. Used to implement locking
* facilities and other concurrency safe constructs.
*
* There's three data types which can be manipulated atomically:
*
* * pg_atomic_flag - supports atomic test/clear operations, useful for
* implementing spinlocks and similar constructs.
*
* * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated
* by several processes at the same time.
*
* * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support
* can be tested by checking for PG_HAVE_64_BIT_ATOMICS.
*
* The values stored in theses cannot be accessed using the API provided in
* thise file, i.e. it is not possible to write statements like `var +=
* 10`. Instead, for this example, one would have to use
* `pg_atomic_fetch_add_u32(&var, 10)`.
*
* This restriction has several reasons: Primarily it doesn't allow non-atomic
* math to be performed on these values which wouldn't be concurrency
* safe. Secondly it allows to implement these types ontop of facilities -
* like C11's atomics - that don't provide direct access. Lastly it serves as
* a useful hint what code should be looked at more carefully because of
* concurrency concerns.
*
* To be useful they usually will need to be placed in memory shared between
* processes or threads, most frequently by embedding them in structs. Be
* careful to align atomic variables to their own size!
*
* Before variables of one these types can be used they needs to be
* initialized using the type's initialization function (pg_atomic_init_flag,
* pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global
* pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These
* initializations have to be performed before any (possibly concurrent)
* access is possible, otherwise the results are undefined.
*
* For several mathematical operations two variants exist: One that returns
* the old value before the operation was performed, and one that that returns
* the new value. *_fetch_<op> variants will return the old value,
* *_<op>_fetch the new one.
*
*
* To bring up postgres on a platform/compiler at the very least
* either one of
* * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
* * pg_atomic_compare_exchange_u32()
* * pg_atomic_exchange_u32()
* and all of
* * pg_compiler_barrier(), pg_memory_barrier()a
* need to be implemented. There exist generic, hardware independent,
* implementations for several compilers which might be sufficient, although
* possibly not optimal, for a new platform.
*
* To add support for a new architecture review whether the compilers used on
* that platform already provide adequate support in the files included
* "Compiler specific" section below. If not either add an include for a new
* file adding generic support for your compiler there and/or add/update an
* architecture specific include in the "Architecture specific" section below.
*
* Operations that aren't implemented by either architecture or compiler
* specific code are implemented ontop of compare_exchange() loops or
* spinlocks.
*
* Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
* whenever possible. Writing correct code using these facilities is hard.
*
* For an introduction to using memory barriers within the PostgreSQL backend,
* see src/backend/storage/lmgr/README.barrier
*
* Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
*
* src/include/storage/atomics.h
*
*-------------------------------------------------------------------------
*/

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0003-Convert-the-PGPROC-lwWaitLink-list-into-a-dlist-inst.patchtext/x-patch; charset=us-asciiDownload
>From e52c608e377fbbb80ffe90d3225142d7b981e740 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 18 Nov 2013 14:45:11 +0100
Subject: [PATCH 3/4] Convert the PGPROC->lwWaitLink list into a dlist instead
 of open coding it.

Besides being shorter and much easier to read it:
* fixes a bug in xlog.c:WakeupWaiters() accidentally waking up
  exclusive lockers
* changes the logic in LWLockRelease() to release all shared lockers
  when waking up any.
* adds a memory pg_write_barrier() in the wakeup paths between
  dequeuing and unsetting ->lwWaiting. On a weakly ordered machine the
  current coding could lead to problems.
---
 src/backend/access/transam/twophase.c |   1 -
 src/backend/access/transam/xlog.c     | 127 +++++++++++++---------------------
 src/backend/storage/lmgr/lwlock.c     | 106 +++++++++++-----------------
 src/backend/storage/lmgr/proc.c       |   2 -
 src/include/storage/proc.h            |   3 +-
 5 files changed, 90 insertions(+), 149 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e975f8d..7aeb6e8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -326,7 +326,6 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->lwWaiting = false;
 	proc->lwWaitMode = 0;
-	proc->lwWaitLink = NULL;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a95149b..ea2fb93 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -393,8 +393,7 @@ typedef struct
 
 	bool		releaseOK;		/* T if ok to release waiters */
 	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
+	dlist_head	waiters;		/* list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
 } XLogInsertSlot;
 
@@ -1626,12 +1625,7 @@ WALInsertSlotAcquireOne(int slotno)
 		 */
 		proc->lwWaiting = true;
 		proc->lwWaitMode = LW_EXCLUSIVE;
-		proc->lwWaitLink = NULL;
-		if (slot->head == NULL)
-			slot->head = proc;
-		else
-			slot->tail->lwWaitLink = proc;
-		slot->tail = proc;
+		dlist_push_tail((dlist_head *) &slot->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&slot->mutex);
@@ -1746,13 +1740,8 @@ WaitOnSlot(volatile XLogInsertSlot *slot, XLogRecPtr waitptr)
 		 */
 		proc->lwWaiting = true;
 		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-
 		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = slot->head;
-		if (slot->head == NULL)
-			slot->tail = proc;
-		slot->head = proc;
+		dlist_push_head((dlist_head *) &slot->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&slot->mutex);
@@ -1804,9 +1793,8 @@ static void
 WakeupWaiters(XLogRecPtr EndPos)
 {
 	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
 
 	/*
 	 * If we have already reported progress up to the same point, do nothing.
@@ -1818,6 +1806,8 @@ WakeupWaiters(XLogRecPtr EndPos)
 	/* xlogInsertingAt should not go backwards */
 	Assert(slot->xlogInsertingAt < EndPos);
 
+	dlist_init(&wakeup);
+
 	/* Acquire mutex.  Time spent holding mutex should be short! */
 	SpinLockAcquire(&slot->mutex);
 
@@ -1827,25 +1817,19 @@ WakeupWaiters(XLogRecPtr EndPos)
 	slot->xlogInsertingAt = EndPos;
 
 	/*
-	 * See if there are any waiters that need to be woken up.
+	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
+	 * up.
 	 */
-	head = slot->head;
-
-	if (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &slot->waiters)
 	{
-		proc = head;
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 
 		/* LW_WAIT_UNTIL_FREE waiters are always in the front of the queue */
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
-		{
-			proc = next;
-			next = next->lwWaitLink;
-		}
+		if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
+			break;
 
-		/* proc is now the last PGPROC to be released */
-		slot->head = next;
-		proc->lwWaitLink = NULL;
+		dlist_delete(&waiter->lwWaitLink);
+		dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1854,13 +1838,13 @@ WakeupWaiters(XLogRecPtr EndPos)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	while (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
 	}
 }
 
@@ -1886,8 +1870,11 @@ static void
 WALInsertSlotReleaseOne(int slotno)
 {
 	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[slotno].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
+	bool		releaseOK = true;
+
+	dlist_init(&wakeup);
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
 	SpinLockAcquire(&slot->mutex);
@@ -1904,43 +1891,28 @@ WALInsertSlotReleaseOne(int slotno)
 	/*
 	 * See if I need to awaken any waiters..
 	 */
-	head = slot->head;
-	if (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &slot->waiters)
 	{
-		if (slot->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 
-			proc = head;
+		dlist_delete(&waiter->lwWaitLink);
+		dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock. These are always in the front of the queue.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
-
-			/*
-			 * Awaken the first exclusive-waiter, if any.
-			 */
-			if (proc->lwWaitLink)
-			{
-				Assert(proc->lwWaitLink->lwWaitMode == LW_EXCLUSIVE);
-				proc = proc->lwWaitLink;
-				releaseOK = false;
-			}
-			/* proc is now the last PGPROC to be released */
-			slot->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
+		/*
+		 * First wake up any backends that want to be woken up without
+		 * acquiring the lock. These are always in the front of the
+		 * queue. Then wakeup the first exclusive-waiter, if any.
+		 */
 
-			slot->releaseOK = releaseOK;
+		/* LW_WAIT_UNTIL_FREE waiters are always in the front of the queue */
+		if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
+		{
+			Assert(waiter->lwWaitMode == LW_EXCLUSIVE);
+			releaseOK = false;
+			break;
 		}
-		else
-			head = NULL;
 	}
+	slot->releaseOK = releaseOK;
 
 	/* We are done updating shared state of the slot itself. */
 	SpinLockRelease(&slot->mutex);
@@ -1948,13 +1920,13 @@ WALInsertSlotReleaseOne(int slotno)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	while (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
 	}
 
 	/*
@@ -5104,8 +5076,7 @@ XLOGShmemInit(void)
 
 		slot->releaseOK = true;
 		slot->exclusive = 0;
-		slot->head = NULL;
-		slot->tail = NULL;
+		dlist_init(&slot->waiters);
 	}
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..732a5d2 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -25,6 +25,7 @@
 #include "access/multixact.h"
 #include "access/subtrans.h"
 #include "commands/async.h"
+#include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "storage/ipc.h"
@@ -43,8 +44,7 @@ typedef struct LWLock
 	bool		releaseOK;		/* T if ok to release waiters */
 	char		exclusive;		/* # of exclusive holders (0 or 1) */
 	int			shared;			/* # of shared holders (0..MaxBackends) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
+	dlist_head	waiters;
 	/* tail is undefined when head is NULL */
 } LWLock;
 
@@ -105,9 +105,9 @@ inline static void
 PRINT_LWDEBUG(const char *where, LWLockId lockid, const volatile LWLock *lock)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): excl %d shared %d head %p rOK %d",
+		elog(LOG, "%s(%d): excl %d shared %d rOK %d",
 			 where, (int) lockid,
-			 (int) lock->exclusive, lock->shared, lock->head,
+			 (int) lock->exclusive, lock->shared,
 			 (int) lock->releaseOK);
 }
 
@@ -287,8 +287,7 @@ CreateLWLocks(void)
 		lock->lock.releaseOK = true;
 		lock->lock.exclusive = 0;
 		lock->lock.shared = 0;
-		lock->lock.head = NULL;
-		lock->lock.tail = NULL;
+		dlist_init(&lock->lock.waiters);
 	}
 
 	/*
@@ -444,12 +443,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 
 		proc->lwWaiting = true;
 		proc->lwWaitMode = mode;
-		proc->lwWaitLink = NULL;
-		if (lock->head == NULL)
-			lock->head = proc;
-		else
-			lock->tail->lwWaitLink = proc;
-		lock->tail = proc;
+		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&lock->mutex);
@@ -657,12 +651,7 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 
 		proc->lwWaiting = true;
 		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-		if (lock->head == NULL)
-			lock->head = proc;
-		else
-			lock->tail->lwWaitLink = proc;
-		lock->tail = proc;
+		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&lock->mutex);
@@ -728,10 +717,12 @@ void
 LWLockRelease(LWLockId lockid)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
-	PGPROC	   *head;
-	PGPROC	   *proc;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
 	int			i;
 
+	dlist_init(&wakeup);
+
 	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
 
 	/*
@@ -767,58 +758,39 @@ LWLockRelease(LWLockId lockid)
 	 * if someone has already awakened waiters that haven't yet acquired the
 	 * lock.
 	 */
-	head = lock->head;
-	if (head != NULL)
+	if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
+		/*
+		 * Remove the to-be-awakened PGPROCs from the queue.
+		 */
+		bool		releaseOK = true;
+		bool		wokeup_somebody = false;
+
+		dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
 		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
+			PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 
-			proc = head;
+			if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
+				continue;
 
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
+			dlist_delete(&waiter->lwWaitLink);
+			dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 
 			/*
-			 * If the front waiter wants exclusive lock, awaken him only.
-			 * Otherwise awaken as many waiters as want shared access.
+			 * Prevent additional wakeups until retryer gets to
+			 * run. Backends that are just waiting for the lock to become
+			 * free don't retry automatically.
 			 */
-			if (proc->lwWaitMode != LW_EXCLUSIVE)
+			if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
 			{
-				while (proc->lwWaitLink != NULL &&
-					   proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
-				{
-					if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-						releaseOK = false;
-					proc = proc->lwWaitLink;
-				}
-			}
-			/* proc is now the last PGPROC to be released */
-			lock->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			/*
-			 * Prevent additional wakeups until retryer gets to run. Backends
-			 * that are just waiting for the lock to become free don't retry
-			 * automatically.
-			 */
-			if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
 				releaseOK = false;
+				wokeup_somebody = true;
+			}
 
-			lock->releaseOK = releaseOK;
-		}
-		else
-		{
-			/* lock is still held, can't awaken anything */
-			head = NULL;
+			if(waiter->lwWaitMode == LW_EXCLUSIVE)
+				break;
 		}
+		lock->releaseOK = releaseOK;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -829,14 +801,14 @@ LWLockRelease(LWLockId lockid)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	while (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
 	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
 	}
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 222251d..75bddcd 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -370,7 +370,6 @@ InitProcess(void)
 		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
-	MyProc->lwWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
@@ -533,7 +532,6 @@ InitAuxiliaryProcess(void)
 	MyPgXact->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
-	MyProc->lwWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3b04d3c..73cea1c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,6 +15,7 @@
 #define _PROC_H_
 
 #include "access/xlogdefs.h"
+#include "lib/ilist.h"
 #include "storage/latch.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"
@@ -101,7 +102,7 @@ struct PGPROC
 	/* Info about LWLock the process is currently waiting for, if any. */
 	bool		lwWaiting;		/* true if waiting for an LW lock */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
-	struct PGPROC *lwWaitLink;	/* next waiter for same LW lock */
+	dlist_node	lwWaitLink;		/* next waiter for same LW lock */
 
 	/* Info about lock the process is currently waiting for, if any. */
 	/* waitLock and waitProcLock are NULL if not currently waiting. */
-- 
1.8.3.251.g1462b67

0004-Wait-free-LW_SHARED-lwlock-acquiration.patchtext/x-patch; charset=us-asciiDownload
>From 47c024c1f7c2cba5a2941a72715be05b8b8c02d6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 18 Nov 2013 14:45:11 +0100
Subject: [PATCH 4/4] Wait free LW_SHARED lwlock acquiration

---
 src/backend/storage/lmgr/lwlock.c | 754 ++++++++++++++++++++++++++------------
 1 file changed, 523 insertions(+), 231 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 732a5d2..fe8858f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -17,6 +17,82 @@
  * IDENTIFICATION
  *	  src/backend/storage/lmgr/lwlock.c
  *
+ * NOTES:
+ *
+ * This used to be a pretty straight forward reader-writer lock
+ * implementation, in which the internal state was protected by a
+ * spinlock. Unfortunately the overhead of taking the spinlock proved to be
+ * too high for workloads/locks that were locked in shared mode very
+ * frequently.
+ * Thus a new implementation was devised that provides wait-free shared lock
+ * acquiration for locks that aren't exclusively locked.
+ *
+ * The basic idea is to have a single atomic variable 'lockcount' instead of
+ * the formerly separate shared and exclusive counters and to use an atomic
+ * increment to acquire the lock. That's fairly easy to do for rw-spinlocks,
+ * but a lot harder for something like LWLocks that want to wait in the OS.
+ *
+ * For exlusive lock acquisition we use an atomic compare-and-exchange on the
+ * lockcount variable swapping in EXCLUSIVE_LOCK/1<<31/0x80000000 if and only
+ * if the current value of lockcount is 0. If the swap was not successfull, we
+ * have to wait.
+ *
+ * For shared lock acquisition we use an atomic add (lock xadd) to the
+ * lockcount variable to add 1. If the value is bigger than EXCLUSIVE_LOCK we
+ * know that somebody actually has an exclusive lock, and we back out by
+ * atomically decrementing by 1 again. If so, we have to wait for the exlusive
+ * locker to release the lock.
+ *
+ * To release the lock we use an atomic decrement to release the lock. If the
+ * new value is zero (we get that atomically), we know we have to release
+ * waiters.
+ *
+ * The attentive reader probably might have noticed that naively doing the
+ * above has two glaring race conditions:
+ *
+ * 1) too-quick-for-queueing: We try to lock using the atomic operations and
+ * notice that we have to wait. Unfortunately until we have finished queuing,
+ * the former locker very well might have already finished it's work. That's
+ * problematic because we're now stuck waiting inside the OS.
+ *
+ * 2) spurious failed locks: Due to the logic of backing out of shared
+ * locks after we unconditionally added a 1 to lockcount, we might have
+ * prevented another exclusive locker from getting the lock:
+ *   1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success
+ *   2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1
+ *   3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK
+ *   4) Session A: LWLockRelease()
+ *   5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait.
+ *   6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1
+ *   7) Session B: LWLockAcquire(LW_SHARED) - wait
+ *
+ * So we'd now have both B) and C) waiting on a lock that nobody is holding
+ * anymore. Not good.
+ *
+ * To mitigate those races we use a two phased attempt at locking:
+ *   Phase 1: * Try to do it atomically, if we succeed, nice
+ *   Phase 2: Add us too the waitqueue of the lock
+ *   Phase 3: Try to grab the lock again, if we succeed, remove ourselves from
+ *            the queue
+ *   Phase 4: Sleep till wakeup, goto Phase 1
+ *
+ * This protects us against both problems from above:
+ * 1) Nobody can release too quick, before we're queued, since after Phase 2 since we're
+ *    already queued.
+ * 2) If somebody spuriously got blocked from acquiring the lock, they will
+ *    get queued in Phase 2 and we can wake them up if neccessary or they will
+ *    have gotten the lock in Phase 2.
+ *
+ * There above algorithm only works for LWLockAcquire, not directly for
+ * LWLockAcquireConditional where we don't want to wait. In that case we just
+ * need to retry acquiring the lock until we're sure we didn't disturb anybody
+ * in doing so.
+ *
+ * TODO:
+ * - decide if we need a spinlock fallback
+ * - expand documentation
+ * - make LWLOCK_STATS do something sensible again
+ * - make LOCK_DEBUG output nicer
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -28,6 +104,7 @@
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "storage/atomics.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -37,15 +114,20 @@
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
+#define EXCLUSIVE_LOCK ((uint32) 1 << 31)
+/* must be greater than MAX_BACKENDS */
+#define SHARED_LOCK_MASK (~EXCLUSIVE_LOCK)
 
 typedef struct LWLock
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
-	dlist_head	waiters;
-	/* tail is undefined when head is NULL */
+	slock_t		mutex;		/* Protects LWLock and queue of PGPROCs */
+	bool		releaseOK;	/* T if ok to release waiters */
+	dlist_head	waiters;	/* list of waiters */
+	pg_atomic_uint32 lockcount;	/* state of exlusive/nonexclusive lockers */
+	pg_atomic_uint32 nwaiters; /* number of waiters */
+#ifdef LWLOCK_DEBUG
+	PGPROC	   *owner;
+#endif
 } LWLock;
 
 /*
@@ -60,7 +142,7 @@ typedef struct LWLock
  * LWLock is between 16 and 32 bytes on all known platforms, so these two
  * cases are sufficient.
  */
-#define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
+#define LWLOCK_PADDED_SIZE	64
 
 typedef union LWLockPadded
 {
@@ -75,7 +157,6 @@ typedef union LWLockPadded
  */
 NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
 
-
 /*
  * We use this structure to keep track of locked LWLocks for release
  * during error recovery.  The maximum size could be determined at runtime
@@ -84,8 +165,14 @@ NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
  */
 #define MAX_SIMUL_LWLOCKS	100
 
+typedef struct LWLockHandle
+{
+	LWLockId lock;
+	LWLockMode mode;
+} LWLockHandle;
+
 static int	num_held_lwlocks = 0;
-static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
+static LWLockHandle held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
@@ -102,24 +189,29 @@ static int *spin_delay_counts;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, LWLockId lockid, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, LWLockId lockid, volatile LWLock *lock, LWLockMode mode)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): excl %d shared %d rOK %d",
-			 where, (int) lockid,
-			 (int) lock->exclusive, lock->shared,
+	{
+		uint32 lockcount = pg_atomic_read_u32(&lock->lockcount);
+		elog(LOG, "%s(%d)%u: excl %u shared %u waiters %u rOK %d",
+			 where, (int) lockid, mode,
+			 (lockcount & EXCLUSIVE_LOCK) >> 31,
+			 (lockcount & SHARED_LOCK_MASK),
+			 pg_atomic_read_u32(&lock->nwaiters),
 			 (int) lock->releaseOK);
+	}
 }
 
 inline static void
-LOG_LWDEBUG(const char *where, LWLockId lockid, const char *msg)
+LOG_LWDEBUG(const char *where, LWLockId lockid, LWLockMode mode, const char *msg)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%d): %s", where, (int) lockid, msg);
+		elog(LOG, "%s(%d)%u: %s", where, (int) lockid, mode, msg);
 }
 #else							/* not LOCK_DEBUG */
-#define PRINT_LWDEBUG(a,b,c)
-#define LOG_LWDEBUG(a,b,c)
+#define PRINT_LWDEBUG(a,b,c,d) ((void)0)
+#define LOG_LWDEBUG(a,b,c,d) ((void)0)
 #endif   /* LOCK_DEBUG */
 
 #ifdef LWLOCK_STATS
@@ -285,8 +377,8 @@ CreateLWLocks(void)
 	{
 		SpinLockInit(&lock->lock.mutex);
 		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
+		pg_atomic_init_u32(&lock->lock.lockcount, 0);
+		pg_atomic_init_u32(&lock->lock.nwaiters, 0);
 		dlist_init(&lock->lock.waiters);
 	}
 
@@ -328,6 +420,231 @@ LWLockAssign(void)
 	return result;
 }
 
+/*
+ * Internal function handling the atomic manipulation of lock->lockcount.
+ *
+ * 'double_check' = true means that we try to check more carefully
+ * against causing somebody else to spuriously believe the lock is
+ * already taken, although we're just about to back out of it.
+ */
+static inline bool
+LWLockAttemptLock(volatile LWLock* lock, LWLockMode mode, bool double_check, bool *potentially_spurious)
+{
+	bool mustwait;
+	uint32 oldstate;
+
+	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+	*potentially_spurious = false;
+
+	if (mode == LW_EXCLUSIVE)
+	{
+		uint32 expected = 0;
+		pg_read_barrier();
+
+		/* check without CAS first; it's way cheaper, frequently locked otherwise */
+		if (pg_atomic_read_u32(&lock->lockcount) != 0)
+			mustwait = true;
+		else if (!pg_atomic_compare_exchange_u32(&lock->lockcount,
+												 &expected, EXCLUSIVE_LOCK))
+		{
+			/*
+			 * ok, no can do. Between the pg_atomic_read() above and the
+			 * CAS somebody else acquired the lock.
+			 */
+			mustwait = true;
+		}
+		else
+		{
+			/* yipeyyahee */
+			mustwait = false;
+#ifdef LWLOCK_DEBUG
+			lock->owner = MyProc;
+#endif
+		}
+	}
+	else
+	{
+		/*
+		 * If requested by caller, do an unlocked check first.  This is useful
+		 * if potentially spurious results have a noticeable cost.
+		 */
+		if (double_check)
+		{
+			pg_read_barrier();
+			if (pg_atomic_read_u32(&lock->lockcount) >= EXCLUSIVE_LOCK)
+			{
+				mustwait = true;
+				goto out;
+			}
+		}
+
+		/*
+		 * Acquire the share lock unconditionally using an atomic addition. We
+		 * might have to back out again if it turns out somebody else has an
+		 * exclusive lock.
+		 */
+		oldstate = pg_atomic_fetch_add_u32(&lock->lockcount, 1);
+
+		if (oldstate >= EXCLUSIVE_LOCK)
+		{
+			/*
+			 * Ok, somebody else holds the lock exclusively. We need to back
+			 * away from the shared lock, since we don't actually hold it right
+			 * now.  Since there's a window between lockcount += 1 and lockcount
+			 * -= 1, the previous exclusive locker could have released and
+			 * another exclusive locker could have seen our +1. We need to
+			 * signal that to the upper layers so they can deal with the race
+			 * condition.
+			 */
+
+			/*
+			 * FIXME: check return value if (double_check), it's not
+			 * spurious if still exclusively locked.
+			 */
+			pg_atomic_fetch_sub_u32(&lock->lockcount, 1);
+
+
+			mustwait = true;
+			*potentially_spurious = true;
+		}
+		else
+		{
+			/* yipeyyahee */
+			mustwait = false;
+		}
+	}
+
+out:
+	return mustwait;
+}
+
+/*
+ * Wakeup all the lockers that currently have a chance to run.
+ */
+static void
+LWLockWakeup(volatile LWLock *lock, LWLockId lockid, LWLockMode mode)
+{
+	bool		releaseOK;
+	bool		wokeup_somebody = false;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
+
+	dlist_init(&wakeup);
+
+	/* remove the to-be-awakened PGPROCs from the queue */
+	releaseOK = true;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->mutex);
+
+	dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
+	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+
+		if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
+			continue;
+
+		dlist_delete(&waiter->lwWaitLink);
+		dlist_push_tail(&wakeup, &waiter->lwWaitLink);
+
+		if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
+		{
+			/*
+			 * Prevent additional wakeups until retryer gets to run. Backends
+			 * that are just waiting for the lock to become free don't retry
+			 * automatically.
+			 */
+			releaseOK = false;
+			/*
+			 * Don't wakeup (further) exclusive locks.
+			 */
+			wokeup_somebody = true;
+		}
+
+		/*
+		 * Once we've woken up an exclusive lock, there's no point in waking
+		 * up anybody else.
+		 */
+		if(waiter->lwWaitMode == LW_EXCLUSIVE)
+			break;
+	}
+	lock->releaseOK = releaseOK;
+
+	/* We are done updating shared state of the lock queue. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
+	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		LOG_LWDEBUG("LWLockRelease", lockid, mode, "release waiter");
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
+	}
+}
+
+/*
+ * Add ourselves to the end of the queue. Mode can be LW_WAIT_UNTIL_FREE here!
+ */
+static inline void
+LWLockQueueSelf(volatile LWLock *lock, LWLockMode mode)
+{
+	/*
+	 * If we don't have a PGPROC structure, there's no way to wait. This
+	 * should never occur, since MyProc should only be null during shared
+	 * memory initialization.
+	 */
+	if (MyProc == NULL)
+		elog(PANIC, "cannot wait without a PGPROC structure");
+
+	pg_atomic_fetch_add_u32(&lock->nwaiters, 1);
+
+	SpinLockAcquire(&lock->mutex);
+	MyProc->lwWaiting = true;
+	MyProc->lwWaitMode = mode;
+	dlist_push_tail((dlist_head *) &lock->waiters, &MyProc->lwWaitLink);
+
+	/* Can release the mutex now */
+	SpinLockRelease(&lock->mutex);
+}
+
+/*
+ * Remove ourselves from the waitlist.  This is used if we queued ourselves
+ * because we thought we needed to sleep but, after further checking, we
+ * discover that we don't actually need to do so. Somebody else might have
+ * already woken us up, in that case return false.
+ */
+static inline bool
+LWLockDequeueSelf(volatile LWLock *lock)
+{
+	bool	found = false;
+	dlist_mutable_iter iter;
+
+	SpinLockAcquire(&lock->mutex);
+
+	/* need to iterate, somebody else could have unqueued us */
+	dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
+	{
+		PGPROC *proc = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		if (proc == MyProc)
+		{
+			found = true;
+			dlist_delete(&proc->lwWaitLink);
+			break;
+		}
+	}
+
+	SpinLockRelease(&lock->mutex);
+
+	if (found)
+		pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
+	return found;
+}
 
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
@@ -341,10 +658,13 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
 	PGPROC	   *proc = MyProc;
-	bool		retry = false;
 	int			extraWaits = 0;
+	bool		potentially_spurious;
+	uint32		iterations = 0;
 
-	PRINT_LWDEBUG("LWLockAcquire", lockid, lock);
+	AssertArg(mode == LW_SHARED || mode == LW_EXCLUSIVE);
+
+	PRINT_LWDEBUG("LWLockAcquire", lockid, lock, mode);
 
 #ifdef LWLOCK_STATS
 	/* Set up local count state first time through in a given process */
@@ -395,58 +715,71 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	{
 		bool		mustwait;
 
-		/* Acquire mutex.  Time spent holding mutex should be short! */
 #ifdef LWLOCK_STATS
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+		/* FIXME this stuff is completely useless now.  Should consider a
+		 * different way to do accounting -- perhaps at LWLockAttemptLock? */
 		spin_delay_counts[lockid] += SpinLockAcquire(&lock->mutex);
-#else
-		SpinLockAcquire(&lock->mutex);
 #endif
 
-		/* If retrying, allow LWLockRelease to release waiters again */
-		if (retry)
-			lock->releaseOK = true;
+		/*
+		 * try to grab the lock the first time, we're not in the waitqueue yet.
+		 */
+		mustwait = LWLockAttemptLock(lock, mode, false, &potentially_spurious);
+
+		if (!mustwait)
+			break;				/* got the lock */
+
+		/*
+		 * Ok, at this point we couldn't grab the lock on the first try. We
+		 * cannot simply queue ourselves to the end of the list and wait to be
+		 * woken up because by now the lock could long have been released.
+		 * Instead add us to the queue and try to grab the lock again. If we
+		 * suceed we need to revert the queuing and be happy, otherwise we
+		 * recheck the lock. If we still couldn't grab it, we know that the
+		 * other lock will see our queue entries when releasing since they
+		 * existed before we checked for the lock.
+		 */
+
+		/* add to the queue */
+		LWLockQueueSelf(lock, mode);
 
-		/* If I can get the lock, do so quickly. */
-		if (mode == LW_EXCLUSIVE)
+		/* we're now guaranteed to be woken up if necessary */
+		mustwait = LWLockAttemptLock(lock, mode, false, &potentially_spurious);
+
+		/* ok, grabbed the lock the second time round, need to undo queueing */
+		if (!mustwait)
 		{
-			if (lock->exclusive == 0 && lock->shared == 0)
+			if (!LWLockDequeueSelf(lock))
 			{
-				lock->exclusive++;
-				mustwait = false;
+				/*
+				 * Somebody else dequeued us and has or will wake us up. Wait
+				 * for the correct wakeup, otherwise our ->lwWaiting would get
+				 * reset at some inconvenient point later, and releaseOk
+				 * wouldn't be managed correctly.
+				 */
+				for (;;)
+				{
+					PGSemaphoreLock(&proc->sem, false);
+					if (!proc->lwWaiting)
+						break;
+					extraWaits++;
+				}
+				lock->releaseOK = true;
 			}
-			else
-				mustwait = true;
+			PRINT_LWDEBUG("LWLockAcquire undo queue", lockid, lock, mode);
+			break;
 		}
 		else
 		{
-			if (lock->exclusive == 0)
-			{
-				lock->shared++;
-				mustwait = false;
-			}
-			else
-				mustwait = true;
+			PRINT_LWDEBUG("LWLockAcquire waiting 4", lockid, lock, mode);
 		}
 
-		if (!mustwait)
-			break;				/* got the lock */
-
 		/*
-		 * Add myself to wait queue.
-		 *
-		 * If we don't have a PGPROC structure, there's no way to wait. This
-		 * should never occur, since MyProc should only be null during shared
-		 * memory initialization.
+		 * NB: There's no need to deal with spurious lock attempts
+		 * here. Anyone we prevented from acquiring the lock will
+		 * enqueue themselves using the same protocol we used here.
 		 */
-		if (proc == NULL)
-			elog(PANIC, "cannot wait without a PGPROC structure");
-
-		proc->lwWaiting = true;
-		proc->lwWaitMode = mode;
-		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
-
-		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
 
 		/*
 		 * Wait until awakened.
@@ -460,7 +793,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockAcquire", lockid, "waiting");
+		LOG_LWDEBUG("LWLockAcquire", lockid, mode, "waiting");
 
 #ifdef LWLOCK_STATS
 		block_counts[lockid]++;
@@ -476,22 +809,22 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 				break;
 			extraWaits++;
 		}
+		lock->releaseOK = true;
 
 		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
 
-		LOG_LWDEBUG("LWLockAcquire", lockid, "awakened");
+		LOG_LWDEBUG("LWLockAcquire", lockid, mode, "awakened");
 
 		/* Now loop back and try to acquire lock again. */
-		retry = true;
+		pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
+		iterations++;
 	}
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
-
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = lockid;
+	held_lwlocks[num_held_lwlocks].lock = lockid;
+	held_lwlocks[num_held_lwlocks++].mode = mode;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -512,8 +845,11 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
 	bool		mustwait;
+	bool		potentially_spurious;
+
+	AssertArg(mode == LW_SHARED || mode == LW_EXCLUSIVE);
 
-	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
+	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock, mode);
 
 	/* Ensure we will have room to remember the lock */
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
@@ -526,48 +862,42 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
-	{
-		if (lock->exclusive == 0 && lock->shared == 0)
-		{
-			lock->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
-		{
-			lock->shared++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
+retry:
+	/*
+	 * passing 'true' to check more carefully to avoid potential
+	 * spurious acquisitions
+	 */
+	mustwait = LWLockAttemptLock(lock, mode, true, &potentially_spurious);
 
 	if (mustwait)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, "failed");
+		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, mode, "failed");
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(lockid, mode);
+
+		/*
+		 * We ran into an exclusive lock and might have blocked another
+		 * exclusive lock from taking a shot because it took a time to back
+		 * off. Retry till we are either sure we didn't block somebody (because
+		 * somebody else certainly has the lock) or till we got it.
+		 *
+		 * We cannot rely on the two-step lock-acquisition protocol as in
+		 * LWLockAcquire because we're not using it.
+		 */
+		if (potentially_spurious)
+		{
+			SPIN_DELAY();
+			goto retry;
+		}
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
+		held_lwlocks[num_held_lwlocks].lock = lockid;
+		held_lwlocks[num_held_lwlocks++].mode = mode;
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
 	}
-
 	return !mustwait;
 }
 
@@ -592,8 +922,11 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+	bool		potentially_spurious_first;
+	bool		potentially_spurious_second;
 
-	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock);
+	Assert(mode == LW_SHARED || mode == LW_EXCLUSIVE);
+	PRINT_LWDEBUG("LWLockAcquireOrWait", lockid, lock, mode);
 
 #ifdef LWLOCK_STATS
 	/* Set up local count state first time through in a given process */
@@ -612,79 +945,64 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
-	{
-		if (lock->exclusive == 0 && lock->shared == 0)
-		{
-			lock->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
-		{
-			lock->shared++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
+	/*
+	 * NB: We're using nearly the same twice-in-a-row lock acquisition
+	 * protocol as LWLockAcquire(). Check its comments for details.
+	 */
+	mustwait = LWLockAttemptLock(lock, mode, false, &potentially_spurious_first);
 
 	if (mustwait)
 	{
-		/*
-		 * Add myself to wait queue.
-		 *
-		 * If we don't have a PGPROC structure, there's no way to wait.  This
-		 * should never occur, since MyProc should only be null during shared
-		 * memory initialization.
-		 */
-		if (proc == NULL)
-			elog(PANIC, "cannot wait without a PGPROC structure");
+		LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE);
 
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
+		mustwait = LWLockAttemptLock(lock, mode, false, &potentially_spurious_second);
 
-		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
-
-		/*
-		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
-		 * wakups, because we share the semaphore with ProcWaitForSignal.
-		 */
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "waiting");
+		if (mustwait)
+		{
+			/*
+			 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
+			 * wakups, because we share the semaphore with ProcWaitForSignal.
+			 */
+			LOG_LWDEBUG("LWLockAcquireOrWait", lockid, mode, "waiting");
 
 #ifdef LWLOCK_STATS
-		block_counts[lockid]++;
+			block_counts[lockid]++;
 #endif
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(lockid, mode);
+			for (;;)
+			{
+				/* "false" means cannot accept cancel/die interrupt here. */
+				PGSemaphoreLock(&proc->sem, false);
+				if (!proc->lwWaiting)
+					break;
+				extraWaits++;
+			}
+			pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
 
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(lockid, mode);
+			LOG_LWDEBUG("LWLockAcquireOrWait", lockid, mode, "awakened");
+		}
+		else
+		{
+			/* got lock in the second attempt, undo queueing */
+			if (!LWLockDequeueSelf(lock))
+			{
+				for (;;)
+				{
+					PGSemaphoreLock(&proc->sem, false);
+					if (!proc->lwWaiting)
+						break;
+					extraWaits++;
+				}
+			}
 
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "awakened");
-	}
-	else
-	{
-		/* We are done updating shared state of the lock itself. */
-		SpinLockRelease(&lock->mutex);
+			/* FIXME: don't need that anymore? */
+#if 0
+			LWLockWakeup(lock, lockid, mode);
+#endif
+		}
 	}
 
 	/*
@@ -697,13 +1015,15 @@ LWLockAcquireOrWait(LWLockId lockid, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, "failed");
+		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, mode, "failed");
 		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE_FAIL(lockid, mode);
 	}
 	else
 	{
+		LOG_LWDEBUG("LWLockAcquireOrWait", lockid, mode, "suceeded");
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = lockid;
+		held_lwlocks[num_held_lwlocks].lock = lockid;
+		held_lwlocks[num_held_lwlocks++].mode = mode;
 		TRACE_POSTGRESQL_LWLOCK_WAIT_UNTIL_FREE(lockid, mode);
 	}
 
@@ -717,13 +1037,11 @@ void
 LWLockRelease(LWLockId lockid)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
-	dlist_head	wakeup;
-	dlist_mutable_iter iter;
 	int			i;
-
-	dlist_init(&wakeup);
-
-	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
+	LWLockMode	mode;
+	uint32		lockcount;
+	bool		check_waiters;
+	bool		have_waiters = false;
 
 	/*
 	 * Remove lock from list of locks held.  Usually, but not always, it will
@@ -731,8 +1049,11 @@ LWLockRelease(LWLockId lockid)
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (lockid == held_lwlocks[i])
+		if (lockid == held_lwlocks[i].lock)
+		{
+			mode = held_lwlocks[i].mode;
 			break;
+		}
 	}
 	if (i < 0)
 		elog(ERROR, "lock %d is not held", (int) lockid);
@@ -740,78 +1061,50 @@ LWLockRelease(LWLockId lockid)
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	PRINT_LWDEBUG("LWLockRelease", lockid, lock, mode);
 
-	/* Release my hold on lock */
-	if (lock->exclusive > 0)
-		lock->exclusive--;
+	pg_read_barrier();
+
+	/* Release my hold on lock, both are a full barrier */
+	if (mode == LW_EXCLUSIVE)
+		lockcount = pg_atomic_sub_fetch_u32(&lock->lockcount, EXCLUSIVE_LOCK);
 	else
-	{
-		Assert(lock->shared > 0);
-		lock->shared--;
-	}
+		lockcount = pg_atomic_sub_fetch_u32(&lock->lockcount, 1);
+
+	/* nobody else can have that kind of lock */
+	Assert(lockcount < EXCLUSIVE_LOCK);
 
 	/*
-	 * See if I need to awaken any waiters.  If I released a non-last shared
-	 * hold, there cannot be anything to do.  Also, do not awaken any waiters
-	 * if someone has already awakened waiters that haven't yet acquired the
-	 * lock.
+	 * Anybody we need to wakeup needs to have started queueing before
+	 * we removed ourselves from the queue and the __sync_ operations
+	 * above are full barriers.
 	 */
-	if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
-	{
-		/*
-		 * Remove the to-be-awakened PGPROCs from the queue.
-		 */
-		bool		releaseOK = true;
-		bool		wokeup_somebody = false;
-
-		dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
-		{
-			PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
-
-			if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
-				continue;
-
-			dlist_delete(&waiter->lwWaitLink);
-			dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 
-			/*
-			 * Prevent additional wakeups until retryer gets to
-			 * run. Backends that are just waiting for the lock to become
-			 * free don't retry automatically.
-			 */
-			if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
-			{
-				releaseOK = false;
-				wokeup_somebody = true;
-			}
+	if (pg_atomic_read_u32(&lock->nwaiters) > 0)
+		have_waiters = true;
+
+	/* we're still waiting for backends to get scheduled, don't release again */
+	if (!lock->releaseOK)
+		check_waiters = false;
+	/* grant permission to run, even if a spurious share lock increases lockcount */
+	else if (mode == LW_EXCLUSIVE && have_waiters)
+		check_waiters = true;
+	/* nobody has this locked anymore, potential exclusive lockers get a chance */
+	else if (lockcount == 0 && have_waiters)
+		check_waiters = true;
+	/* nobody queued or not free */
+	else
+		check_waiters = false;
 
-			if(waiter->lwWaitMode == LW_EXCLUSIVE)
-				break;
-		}
-		lock->releaseOK = releaseOK;
+	if (check_waiters)
+	{
+		PRINT_LWDEBUG("LWLockRelease releasing", lockid, lock, mode);
+		LWLockWakeup(lock, lockid, mode);
 	}
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
-
 	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
 
 	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
-	{
-		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
-		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
-		dlist_delete(&waiter->lwWaitLink);
-		pg_write_barrier();
-		waiter->lwWaiting = false;
-		PGSemaphoreUnlock(&waiter->sem);
-	}
-
-	/*
 	 * Now okay to allow cancel/die interrupts.
 	 */
 	RESUME_INTERRUPTS();
@@ -834,7 +1127,7 @@ LWLockReleaseAll(void)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
 
-		LWLockRelease(held_lwlocks[num_held_lwlocks - 1]);
+		LWLockRelease(held_lwlocks[num_held_lwlocks - 1].lock);
 	}
 }
 
@@ -842,8 +1135,7 @@ LWLockReleaseAll(void)
 /*
  * LWLockHeldByMe - test whether my process currently holds a lock
  *
- * This is meant as debug support only.  We do not distinguish whether the
- * lock is held shared or exclusive.
+ * This is meant as debug support only.
  */
 bool
 LWLockHeldByMe(LWLockId lockid)
@@ -852,7 +1144,7 @@ LWLockHeldByMe(LWLockId lockid)
 
 	for (i = 0; i < num_held_lwlocks; i++)
 	{
-		if (held_lwlocks[i] == lockid)
+		if (held_lwlocks[i].lock == lockid)
 			return true;
 	}
 	return false;
-- 
1.8.3.251.g1462b67

0001-Try-to-make-HP-UX-s-ac-not-cause-warnings-about-unus.patchtext/x-patch; charset=us-asciiDownload
>From f6927019352052f8b3d9cbde3d0084b666449426 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 18 Nov 2013 14:45:11 +0100
Subject: [PATCH 1/4] Try to make HP-UX's ac++ not cause warnings about unused
 static inlines.

---
 src/template/hpux | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/template/hpux b/src/template/hpux
index ce4d93c..6349ae5 100644
--- a/src/template/hpux
+++ b/src/template/hpux
@@ -3,7 +3,9 @@
 CPPFLAGS="$CPPFLAGS -D_XOPEN_SOURCE_EXTENDED"
 
 if test "$GCC" != yes ; then
-  CC="$CC -Ae"
+  # -Ae enables C89/C99
+  # +W2177 should disable warnings about unused static inlines
+  CC="$CC -Ae +W2177"
   CFLAGS="+O2"
 fi
 
-- 
1.8.3.251.g1462b67

0002-Very-basic-atomic-ops-implementation.patchtext/x-patch; charset=iso-8859-1Download
>From 17f35a602a315fb7793e56e2905e0d09ad774524 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 18 Nov 2013 14:45:11 +0100
Subject: [PATCH 2/4] Very basic atomic ops implementation

Only gcc has been tested, stub implementations for
* msvc
* HUPX acc
* IBM xlc
* Sun/Oracle Sunpro compiler
are included.

All implementations only provide the bare minimum of operations for
now and rely on emulating more complex operations via
compare_and_swap().

Most of s_lock.h is removed and replaced by usage of the atomics ops.
---
 config/c-compiler.m4                             |  77 ++
 configure                                        | 321 +++++++--
 configure.in                                     |  17 +-
 src/backend/storage/lmgr/Makefile                |   2 +-
 src/backend/storage/lmgr/atomics.c               |  26 +
 src/backend/storage/lmgr/spin.c                  |   8 -
 src/include/c.h                                  |   7 +
 src/include/pg_config.h.in                       |  11 +-
 src/include/storage/atomics-arch-alpha.h         |  25 +
 src/include/storage/atomics-arch-amd64.h         |  56 ++
 src/include/storage/atomics-arch-arm.h           |  29 +
 src/include/storage/atomics-arch-hppa.h          |  17 +
 src/include/storage/atomics-arch-i386.h          |  96 +++
 src/include/storage/atomics-arch-ia64.h          |  27 +
 src/include/storage/atomics-arch-ppc.h           |  25 +
 src/include/storage/atomics-generic-acc.h        |  99 +++
 src/include/storage/atomics-generic-gcc.h        | 174 +++++
 src/include/storage/atomics-generic-msvc.h       |  83 +++
 src/include/storage/atomics-generic-sunpro.h     |  66 ++
 src/include/storage/atomics-generic-xlc.h        |  84 +++
 src/include/storage/atomics-generic.h            | 402 +++++++++++
 src/include/storage/atomics.h                    | 596 +++++++++++++++
 src/include/storage/barrier.h                    | 137 +---
 src/include/storage/s_lock.h                     | 880 +----------------------
 src/test/regress/expected/lock.out               |   8 +
 src/test/regress/input/create_function_1.source  |   5 +
 src/test/regress/output/create_function_1.source |   4 +
 src/test/regress/regress.c                       | 213 ++++++
 src/test/regress/sql/lock.sql                    |   5 +
 29 files changed, 2412 insertions(+), 1088 deletions(-)
 create mode 100644 src/backend/storage/lmgr/atomics.c
 create mode 100644 src/include/storage/atomics-arch-alpha.h
 create mode 100644 src/include/storage/atomics-arch-amd64.h
 create mode 100644 src/include/storage/atomics-arch-arm.h
 create mode 100644 src/include/storage/atomics-arch-hppa.h
 create mode 100644 src/include/storage/atomics-arch-i386.h
 create mode 100644 src/include/storage/atomics-arch-ia64.h
 create mode 100644 src/include/storage/atomics-arch-ppc.h
 create mode 100644 src/include/storage/atomics-generic-acc.h
 create mode 100644 src/include/storage/atomics-generic-gcc.h
 create mode 100644 src/include/storage/atomics-generic-msvc.h
 create mode 100644 src/include/storage/atomics-generic-sunpro.h
 create mode 100644 src/include/storage/atomics-generic-xlc.h
 create mode 100644 src/include/storage/atomics-generic.h
 create mode 100644 src/include/storage/atomics.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 4ba3236..c24cb59 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -289,3 +289,80 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_PROG_CC_LDFLAGS_OPT
+
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_ATOMICS],
+[AC_CACHE_CHECK(for __builtin_constant_p, pgac_cv__builtin_constant_p,
+[AC_TRY_COMPILE([static int x; static int y[__builtin_constant_p(x) ? x : 1];],
+[],
+[pgac_cv__builtin_constant_p=yes],
+[pgac_cv__builtin_constant_p=no])])
+if test x"$pgac_cv__builtin_constant_p" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CONSTANT_P, 1,
+          [Define to 1 if your compiler understands __builtin_constant_p.])
+fi])# PGAC_C_BUILTIN_CONSTANT_P
+
+# PGAC_HAVE_GCC__SYNC_INT32_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(),
+# and define HAVE_GCC__SYNC_INT32_TAS
+#
+# NB: There are platforms where test_and_set is available but compare_and_swap
+# is not, so test this separa
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_TAS],
+[AC_CACHE_CHECK(for builtin locking functions, pgac_cv_gcc_sync_int32_tas,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_int32_tas="yes"],
+  [pgac_cv_gcc_sync_int32_tas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 32bit
+# types, and define HAVE_GCC__SYNC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __sync int32 atomic operations, pgac_cv_gcc_sync_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);],
+  [pgac_cv_gcc_sync_int32_cas="yes"],
+  [pgac_cv_gcc_sync_int32_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int *, int, int).])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_CAS
+
+# PGAC_HAVE_GCC__SYNC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 64bit
+# types, and define HAVE_GCC__SYNC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __sync int64 atomic operations, pgac_cv_gcc_sync_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);],
+  [pgac_cv_gcc_sync_int64_cas="yes"],
+  [pgac_cv_gcc_sync_int64_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64).])
+fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+# -------------------------
+
+# Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
+# types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int32_cas="yes"],
+  [pgac_cv_gcc_atomic_int32_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
diff --git a/configure b/configure
index 74e34eb..04cd0ec 100755
--- a/configure
+++ b/configure
@@ -22984,71 +22984,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:$LINENO: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if test "${pgac_cv_gcc_int_atomics+set}" = set; then
-  $as_echo_n "(cached) " >&6
-else
-  cat >conftest.$ac_ext <<_ACEOF
-/* confdefs.h.  */
-_ACEOF
-cat confdefs.h >>conftest.$ac_ext
-cat >>conftest.$ac_ext <<_ACEOF
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-rm -f conftest.$ac_objext conftest$ac_exeext
-if { (ac_try="$ac_link"
-case "(($ac_try" in
-  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
-  *) ac_try_echo=$ac_try;;
-esac
-eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
-$as_echo "$ac_try_echo") >&5
-  (eval "$ac_link") 2>conftest.er1
-  ac_status=$?
-  grep -v '^ *+' conftest.er1 >conftest.err
-  rm -f conftest.er1
-  cat conftest.err >&5
-  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
-  (exit $ac_status); } && {
-	 test -z "$ac_c_werror_flag" ||
-	 test ! -s conftest.err
-       } && test -s conftest$ac_exeext && {
-	 test "$cross_compiling" = yes ||
-	 $as_test_x conftest$ac_exeext
-       }; then
-  pgac_cv_gcc_int_atomics="yes"
-else
-  $as_echo "$as_me: failed program was:" >&5
-sed 's/^/| /' conftest.$ac_ext >&5
-
-	pgac_cv_gcc_int_atomics="no"
-fi
-
-rm -rf conftest.dSYM
-rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
-      conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-cat >>confdefs.h <<\_ACEOF
-#define HAVE_GCC_INT_ATOMICS 1
-_ACEOF
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -28815,6 +28750,262 @@ _ACEOF
 fi
 
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+{ $as_echo "$as_me:$LINENO: checking for builtin locking functions" >&5
+$as_echo_n "checking for builtin locking functions... " >&6; }
+if test "${pgac_cv_gcc_sync_int32_tas+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_sync_int32_tas="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_sync_int32_tas="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_sync_int32_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC__SYNC_INT32_TAS 1
+_ACEOF
+
+fi
+{ $as_echo "$as_me:$LINENO: checking for builtin __sync int32 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int32 atomic operations... " >&6; }
+if test "${pgac_cv_gcc_sync_int32_cas+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_sync_int32_cas="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_sync_int32_cas="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_sync_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC__SYNC_INT32_CAS 1
+_ACEOF
+
+fi
+{ $as_echo "$as_me:$LINENO: checking for builtin __sync int64 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int64 atomic operations... " >&6; }
+if test "${pgac_cv_gcc_sync_int64_cas+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_sync_int64_cas="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_sync_int64_cas="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_sync_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC__SYNC_INT64_CAS 1
+_ACEOF
+
+fi
+{ $as_echo "$as_me:$LINENO: checking for builtin __atomic int32 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int32 atomic operations... " >&6; }
+if test "${pgac_cv_gcc_atomic_int32_cas+set}" = set; then
+  $as_echo_n "(cached) " >&6
+else
+  cat >conftest.$ac_ext <<_ACEOF
+/* confdefs.h.  */
+_ACEOF
+cat confdefs.h >>conftest.$ac_ext
+cat >>conftest.$ac_ext <<_ACEOF
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+rm -f conftest.$ac_objext conftest$ac_exeext
+if { (ac_try="$ac_link"
+case "(($ac_try" in
+  *\"* | *\`* | *\\*) ac_try_echo=\$ac_try;;
+  *) ac_try_echo=$ac_try;;
+esac
+eval ac_try_echo="\"\$as_me:$LINENO: $ac_try_echo\""
+$as_echo "$ac_try_echo") >&5
+  (eval "$ac_link") 2>conftest.er1
+  ac_status=$?
+  grep -v '^ *+' conftest.er1 >conftest.err
+  rm -f conftest.er1
+  cat conftest.err >&5
+  $as_echo "$as_me:$LINENO: \$? = $ac_status" >&5
+  (exit $ac_status); } && {
+	 test -z "$ac_c_werror_flag" ||
+	 test ! -s conftest.err
+       } && test -s conftest$ac_exeext && {
+	 test "$cross_compiling" = yes ||
+	 $as_test_x conftest$ac_exeext
+       }; then
+  pgac_cv_gcc_atomic_int32_cas="yes"
+else
+  $as_echo "$as_me: failed program was:" >&5
+sed 's/^/| /' conftest.$ac_ext >&5
+
+	pgac_cv_gcc_atomic_int32_cas="no"
+fi
+
+rm -rf conftest.dSYM
+rm -f core conftest.err conftest.$ac_objext conftest_ipa8_conftest.oo \
+      conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:$LINENO: result: $pgac_cv_gcc_atomic_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+
+cat >>confdefs.h <<\_ACEOF
+#define HAVE_GCC__ATOMIC_INT32_CAS 1
+_ACEOF
+
+fi
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/configure.in b/configure.in
index 304399e..e9d8309 100644
--- a/configure.in
+++ b/configure.in
@@ -1453,17 +1453,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1738,6 +1727,12 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+PGAC_HAVE_GCC__SYNC_INT32_TAS
+PGAC_HAVE_GCC__SYNC_INT32_CAS
+PGAC_HAVE_GCC__SYNC_INT64_CAS
+PGAC_HAVE_GCC__ATOMIC_INT32_CAS
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..be95d50 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/storage/lmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
+OBJS = atomics.o lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/atomics.c b/src/backend/storage/lmgr/atomics.c
new file mode 100644
index 0000000..af95153
--- /dev/null
+++ b/src/backend/storage/lmgr/atomics.c
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.c
+ *	   Non-Inline parts of the atomics implementation
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/lmgr/atomics.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/*
+ * We want the functions below to be inline; but if the compiler doesn't
+ * support that, fall back on providing them as regular functions.	See
+ * STATIC_IF_INLINE in c.h.
+ */
+#define ATOMICS_INCLUDE_DEFINITIONS
+
+#include "storage/atomics.h"
+
+#ifdef PG_USE_ATOMICS_EMULATION
+#endif
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index f054be8..469dd84 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -86,14 +86,6 @@ s_unlock_sema(volatile slock_t *lock)
 	PGSemaphoreUnlock((PGSemaphore) lock);
 }
 
-bool
-s_lock_free_sema(volatile slock_t *lock)
-{
-	/* We don't currently use S_LOCK_FREE anyway */
-	elog(ERROR, "spin.c does not support S_LOCK_FREE()");
-	return false;
-}
-
 int
 tas_sema(volatile slock_t *lock)
 {
diff --git a/src/include/c.h b/src/include/c.h
index 6e19c6d..347f394 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -866,6 +866,13 @@ typedef NameData *Name;
 #define STATIC_IF_INLINE
 #endif   /* PG_USE_INLINE */
 
+#ifdef PG_USE_INLINE
+#define STATIC_IF_INLINE_DECLARE static inline
+#else
+#define STATIC_IF_INLINE_DECLARE extern
+#endif   /* PG_USE_INLINE */
+
+
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
  * ----------------------------------------------------------------
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 5eac52d..3dd76a2 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -173,8 +173,17 @@
 /* Define to 1 if your compiler understands __FUNCTION__. */
 #undef HAVE_FUNCNAME__FUNCTION
 
+/* Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int). */
+#undef HAVE_GCC__ATOMIC_INT32_CAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int *, int, int). */
+#undef HAVE_GCC__SYNC_INT32_CAS
+
 /* Define to 1 if you have __sync_lock_test_and_set(int *) and friends. */
-#undef HAVE_GCC_INT_ATOMICS
+#undef HAVE_GCC__SYNC_INT32_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64). */
+#undef HAVE_GCC__SYNC_INT64_CAS
 
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
diff --git a/src/include/storage/atomics-arch-alpha.h b/src/include/storage/atomics-arch-alpha.h
new file mode 100644
index 0000000..06b7125
--- /dev/null
+++ b/src/include/storage/atomics-arch-alpha.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to Alpha
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-alpha.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-arch-amd64.h b/src/include/storage/atomics-arch-amd64.h
new file mode 100644
index 0000000..90f4abc
--- /dev/null
+++ b/src/include/storage/atomics-arch-amd64.h
@@ -0,0 +1,56 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-amd64.h
+ *	  Atomic operations considerations specific to intel/amd x86-64
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-amd64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * x86_64 has similar ordering characteristics to i386.
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier()
+#define pg_write_barrier_impl()		pg_compiler_barrier()
+
+#if defined(__GNUC__)
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * Adding a PAUSE in the spin delay loop is demonstrably a no-op on
+	 * Opteron, but it may be of some use on EM64T, so we keep it.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif
+
+#elif defined(WIN32_ONLY_COMPILER)
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#endif
diff --git a/src/include/storage/atomics-arch-arm.h b/src/include/storage/atomics-arch-arm.h
new file mode 100644
index 0000000..a00f8f5
--- /dev/null
+++ b/src/include/storage/atomics-arch-arm.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-arm.h
+ *	  Atomic operations considerations specific to ARM
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-arm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+
+/*
+ * FIXME: Disable 32bit atomics for ARMs before v6 or error out
+ */
+
+/*
+ * 64 bit atomics on arm are implemented using kernel fallbacks and might be
+ * slow, so disable entirely for now.
+ */
+#define PG_DISABLE_64_BIT_ATOMICS
diff --git a/src/include/storage/atomics-arch-hppa.h b/src/include/storage/atomics-arch-hppa.h
new file mode 100644
index 0000000..8913801
--- /dev/null
+++ b/src/include/storage/atomics-arch-hppa.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to HPPA
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-hppa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* HPPA doesn't do either read or write reordering */
+#define pg_memory_barrier_impl()		pg_compiler_barrier_impl()
diff --git a/src/include/storage/atomics-arch-i386.h b/src/include/storage/atomics-arch-i386.h
new file mode 100644
index 0000000..11d1b1e
--- /dev/null
+++ b/src/include/storage/atomics-arch-i386.h
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-i386.h
+ *	  Atomic operations considerations specific to intel x86
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-i386.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * i386 does not allow loads to be reordered with other loads, or stores to be
+ * reordered with other stores, but a load can be performed before a subsequent
+ * store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ */
+
+#if defined(__INTEL_COMPILER)
+#define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier()
+#define pg_write_barrier_impl()		pg_compiler_barrier()
+
+#if defined(__GNUC__)
+
+/*
+ * Implement TAS if not available as a __sync builtin on very old gcc
+ * versions.
+ */
+#ifndef HAVE_GCC__SYNC_INT32_TAS
+static __inline__ int
+tas(volatile slock_t *lock)
+	register slock_t _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(*lock)
+:
+:		"memory", "cc");
+	return (int) _res;
+}
+#endif
+
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * This sequence is equivalent to the PAUSE instruction ("rep" is
+	 * ignored by old IA32 processors if the following instruction is
+	 * not a string operation); the IA-32 Architecture Software
+	 * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
+	 * PAUSE in the inner loop of a spin lock is necessary for good
+	 * performance:
+	 *
+	 *     The PAUSE instruction improves the performance of IA-32
+	 *     processors supporting Hyper-Threading Technology when
+	 *     executing spin-wait loops and other routines where one
+	 *     thread is accessing a shared lock or semaphore in a tight
+	 *     polling loop. When executing a spin-wait loop, the
+	 *     processor can suffer a severe performance penalty when
+	 *     exiting the loop because it detects a possible memory order
+	 *     violation and flushes the core processor's pipeline. The
+	 *     PAUSE instruction provides a hint to the processor that the
+	 *     code sequence is a spin-wait loop. The processor uses this
+	 *     hint to avoid the memory order violation and prevent the
+	 *     pipeline flush. In addition, the PAUSE instruction
+	 *     de-pipelines the spin-wait loop to prevent it from
+	 *     consuming execution resources excessively.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif
+
+#elif defined(WIN32_ONLY_COMPILER)
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	/* See comment for gcc code. Same code, MASM syntax */
+	__asm rep nop;
+}
+#endif
diff --git a/src/include/storage/atomics-arch-ia64.h b/src/include/storage/atomics-arch-ia64.h
new file mode 100644
index 0000000..2863431
--- /dev/null
+++ b/src/include/storage/atomics-arch-ia64.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ia64.h
+ *	  Atomic operations considerations specific to intel itanium
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ia64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		__mf()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		__asm__ __volatile__ ("mf" : : : "memory")
+#elif defined(__hpux)
+#	define pg_compiler_barrier_impl()	_Asm_sched_fence()
+#	define pg_memory_barrier_impl()		_Asm_mf()
+#endif
diff --git a/src/include/storage/atomics-arch-ppc.h b/src/include/storage/atomics-arch-ppc.h
new file mode 100644
index 0000000..f785c99
--- /dev/null
+++ b/src/include/storage/atomics-arch-ppc.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ppc.h
+ *	  Atomic operations considerations specific to PowerPC
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ppc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-generic-acc.h b/src/include/storage/atomics-generic-acc.h
new file mode 100644
index 0000000..b1c9200
--- /dev/null
+++ b/src/include/storage/atomics-generic-acc.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-acc.h
+ *	  Atomic operations support when using HPs acc on HPUX
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * inline assembly for Itanium-based HP-UX:
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
+ * * Implementing Spinlocks on the Intel ® Itanium ® Architecture and PA-RISC
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
+ *
+ * Itanium only supports a small set of numbers (6, -8, -4, -1, 1, 4, 8, 16)
+ * for atomic add/sub, so we just implement everything but compare_exchange via
+ * the compare_exchange fallbacks in atomics-generic.h.
+ *
+ * src/include/storage/atomics-generic-acc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <machine/sys/inline.h>
+
+/* IA64 always has 32/64 bit atomics */
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define MINOR_FENCE (_Asm_fence) (_UP_CALL_FENCE | _UP_SYS_FENCE | \
+								 _DOWN_CALL_FENCE | _DOWN_SYS_FENCE )
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	/*
+	 * We want a barrier, not just release/acquire semantics.
+	 */
+	_Asm_mf();
+	/*
+	 * Notes:
+	 * DOWN_MEM_FENCE | _UP_MEM_FENCE prevents reordering by the compiler
+	 */
+	current =  _Asm_cmpxchg(_SZ_W, /* word */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	_Asm_mf();
+	current =  _Asm_cmpxchg(_SZ_D, /* doubleword */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#undef MINOR_FENCE
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-gcc.h b/src/include/storage/atomics-generic-gcc.h
new file mode 100644
index 0000000..3675b9d
--- /dev/null
+++ b/src/include/storage/atomics-generic-gcc.h
@@ -0,0 +1,174 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-gcc.h
+ *	  Atomic operations, implemented using gcc (or compatible) intrinsics.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Legacy __sync Built-in Functions for Atomic Memory Access
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fsync-Builtins.html
+ * * Built-in functions for memory model aware atomic operations
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
+ *
+ * src/include/storage/atomics-generic-gcc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+#define pg_compiler_barrier_impl()	__asm__ __volatile__("" ::: "memory");
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#if !defined(pg_memory_barrier_impl)
+#	if defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
+#	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
+#		define pg_memory_barrier_impl()		__sync_synchronize()
+#	endif
+#endif /* !defined(pg_memory_barrier_impl) */
+
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
+#endif
+
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
+#endif
+
+#ifdef HAVE_GCC__SYNC_INT32_TAS
+typedef struct pg_atomic_flag
+{
+	volatile uint32 value;
+} pg_atomic_flag;
+
+#define PG_ATOMIC_INIT_FLAG() {0}
+#endif
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#ifdef HAVE_GCC__PG_ATOMIC_INT64_CAS
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#	define PG_HAVE_PG_ATOMIC_COMPARE_EXCHANGE_U64
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+#endif /* HAVE_GCC__PG_ATOMIC_INT64_CAS */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_INT32_TAS)
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 ret;
+	/* XXX: only a acquire barrier, make it a full one for now */
+	pg_memory_barrier_impl();
+	/* some platform only support a 1 here */
+	ret = __sync_lock_test_and_set(&ptr->value, 1);
+	return ret == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value;
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: only a release barrier, make it a full one for now */
+	pg_memory_barrier_impl();
+	__sync_lock_release(&ptr->value);
+}
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_INT32_TAS) */
+
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS)
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+
+#elif defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#else
+#	error "strange configuration"
+#endif
+
+#ifdef HAVE_GCC__ATOMIC_INT64_CAS
+
+/* if __atomic is supported for 32 it's also supported for 64bit */
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS)
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+
+#elif defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#else
+#	error "strange configuration"
+#endif
+
+#endif /* HAVE_GCC__ATOMIC_INT64_CAS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-msvc.h b/src/include/storage/atomics-generic-msvc.h
new file mode 100644
index 0000000..baead62
--- /dev/null
+++ b/src/include/storage/atomics-generic-msvc.h
@@ -0,0 +1,83 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-msvc.h
+ *	  Atomic operations support when using MSVC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Interlocked Variable Access
+ *   http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx
+ *
+ * src/include/storage/atomics-generic-msvc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <intrin.h>
+#include <windows.h>
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/* Should work on both MSVC and Borland. */
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier_impl()	_ReadWriteBarrier()
+
+#ifndef pg_memory_barrier_impl
+#define pg_memory_barrier_impl()	MemoryBarrier()
+#endif
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_PG_ATOMIC_COMPARE_EXCHANGE_U64
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#ifdef PG_HAVE_PG_ATOMIC_COMPARE_EXCHANGE_U64
+/*
+ * The non-intrinsics version is only available in vista upwards, so use the
+ * intrisc version. Only supported on >486, but we require XP as a minimum
+ * baseline, which doesn't support the 486, so we don't need to add checks for
+ * that case.
+ */
+#pragma intrinsic(_InterlockedCompareExchange64)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = _InterlockedCompareExchange64(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif /* PG_HAVE_PG_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-sunpro.h b/src/include/storage/atomics-generic-sunpro.h
new file mode 100644
index 0000000..d512409
--- /dev/null
+++ b/src/include/storage/atomics-generic-sunpro.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-sunpro.h
+ *	  Atomic operations for solaris' CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * manpage for atomic_cas(3C)
+ *   http://www.unix.com/man-page/opensolaris/3c/atomic_cas/
+ *   http://docs.oracle.com/cd/E23824_01/html/821-1465/atomic-cas-3c.html
+ *
+ * src/include/storage/atomics-generic-sunpro.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_32(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_64(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#endif
diff --git a/src/include/storage/atomics-generic-xlc.h b/src/include/storage/atomics-generic-xlc.h
new file mode 100644
index 0000000..6a47fe2
--- /dev/null
+++ b/src/include/storage/atomics-generic-xlc.h
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-xlc.h
+ *	  Atomic operations for IBM's CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Synchronization and atomic built-in functions
+ *   http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/bif_sync.htm
+ *
+ * src/include/storage/atomics-generic-xlc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+typedef pg_atomic_uint32 pg_atomic_flag;
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* declare outside the defined(PG_USE_INLINE) for visibility */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+
+/* FIXME: check for 64bit mode, only supported there */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	/*
+	 * xlc's documentation tells us:
+	 * "If __compare_and_swap is used as a locking primitive, insert a call to
+	 * the __isync built-in function at the start of any critical sections."
+	 */
+	__isync();
+
+	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
+										 (int)*expected, (int)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	__isync();
+
+	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
+										   (long)*expected, (long)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif /* defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64) */
+
+#endif
diff --git a/src/include/storage/atomics-generic.h b/src/include/storage/atomics-generic.h
new file mode 100644
index 0000000..6c0858d
--- /dev/null
+++ b/src/include/storage/atomics-generic.h
@@ -0,0 +1,402 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic.h
+ *	  Atomic operations.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * src/include/storage/atomics-generic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+/* Need a way to implement spinlocks */
+#if !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && \
+	!defined(PG_HAVE_ATOMIC_EXCHANGE_U32) && \
+	!defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#	error "postgres requires atomic ops for locking to be provided"
+#endif
+
+#ifndef pg_memory_barrier_impl
+#	error "postgres requires a memory barrier implementation"
+#endif
+
+#ifndef pg_compiler_barrier_impl
+#	error "postgres requires a compiler barrier implementation"
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ */
+#if !defined(pg_read_barrier_impl)
+#	define pg_read_barrier_impl pg_memory_barrier_impl
+#endif
+#if !defined(pg_write_barrier_impl)
+#	define pg_write_barrier_impl pg_memory_barrier_impl
+#endif
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+#define pg_spin_delay_impl()	((void)0)
+#endif
+
+
+/* provide fallback */
+#ifndef PG_HAS_ATOMIC_TEST_SET_FLAG
+typedef pg_atomic_uint32 pg_atomic_flag;
+#define PG_ATOMIC_INIT_FLAG() {0}
+#endif
+
+#if !defined(PG_DISABLE_64_BIT_ATOMICS) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#	define PG_HAVE_64_BIT_ATOMICS
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifndef PG_HAS_ATOMIC_READ_U32
+#define PG_HAS_ATOMIC_READ_U32
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	return ptr->value;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U32
+#define PG_HAS_ATOMIC_WRITE_U32
+STATIC_IF_INLINE void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	ptr->value = val;
+}
+#endif
+
+/*
+ * provide fallback for test_and_set using atomic_exchange if available
+ */
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_exchange_u32_impl(ptr, &value, 1) == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr);
+}
+
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+/*
+ * provide fallback for test_and_set using atomic_compare_exchange if
+ * available.
+ */
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+STATIC_IF_INLINE void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 value = 0;
+	return pg_atomic_compare_exchange_u32_impl(ptr, &value, 1);
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+STATIC_IF_INLINE bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr);
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+STATIC_IF_INLINE void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG)
+#	error "No pg_atomic_test_and_set provided"
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) */
+
+
+#ifndef PG_HAS_ATOMIC_INIT_U32
+#define PG_HAS_ATOMIC_INIT_U32
+STATIC_IF_INLINE void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	pg_atomic_write_u32_impl(ptr, val_);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_EXCHANGE_U32
+#define PG_HAS_ATOMIC_EXCHANGE_U32
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_U32
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_SUB_U32
+#define PG_HAS_ATOMIC_FETCH_SUB_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old - sub_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_AND_U32
+#define PG_HAS_ATOMIC_FETCH_AND_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_OR_U32
+#define PG_HAS_ATOMIC_FETCH_OR_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_ADD_FETCH_U32
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_SUB_FETCH_U32
+#define PG_HAS_ATOMIC_SUB_FETCH_U32
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+#define PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 add_, uint32 until)
+{
+	uint32 old;
+	while (true)
+	{
+		uint32 new_val;
+		old = pg_atomic_read_u32_impl(ptr);
+		if (old >= until)
+			break;
+
+		new_val = old + add_;
+		if (new_val > until)
+			new_val = until;
+
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, new_val))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifdef PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+
+#ifndef PG_HAS_ATOMIC_INIT_U64
+#define PG_HAS_ATOMIC_INIT_U64
+STATIC_IF_INLINE void
+pg_atomic_init_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val_)
+{
+	pg_atomic_write_u64_impl(ptr, val_);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_READ_U64
+#define PG_HAS_ATOMIC_READ_U64
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	return ptr->value;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+STATIC_IF_INLINE void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	ptr->value = val;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_ADD_U64
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_SUB_U64
+#define PG_HAS_ATOMIC_FETCH_SUB_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old - sub_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_AND_U64
+#define PG_HAS_ATOMIC_FETCH_AND_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_FETCH_OR_U64
+#define PG_HAS_ATOMIC_FETCH_OR_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_64_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_ADD_FETCH_U64
+#define PG_HAS_ATOMIC_ADD_FETCH_U64
+STATIC_IF_INLINE
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, add_) + add_;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_SUB_FETCH_U64
+#define PG_HAS_ATOMIC_SUB_FETCH_U64
+STATIC_IF_INLINE
+pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#endif /* PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics.h b/src/include/storage/atomics.h
new file mode 100644
index 0000000..d16d6eb
--- /dev/null
+++ b/src/include/storage/atomics.h
@@ -0,0 +1,596 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.h
+ *	  Generic atomic operations support.
+ *
+ * Hardware and compiler dependent functions for manipulating memory
+ * atomically and dealing with cache coherency. Used to implement locking
+ * facilities and other concurrency safe constructs.
+ *
+ * There's three data types which can be manipulated atomically:
+ *
+ * * pg_atomic_flag - supports atomic test/clear operations, useful for
+ *      implementing spinlocks and similar constructs.
+ *
+ * * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated
+ *      by several processes at the same time.
+ *
+ * * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support
+ *      can be tested by checking for PG_HAVE_64_BIT_ATOMICS.
+ *
+ * The values stored in theses cannot be accessed using the API provided in
+ * thise file, i.e. it is not possible to write statements like `var +=
+ * 10`. Instead, for this example, one would have to use
+ * `pg_atomic_fetch_add_u32(&var, 10)`.
+ *
+ * This restriction has several reasons: Primarily it doesn't allow non-atomic
+ * math to be performed on these values which wouldn't be concurrency
+ * safe. Secondly it allows to implement these types ontop of facilities -
+ * like C11's atomics - that don't provide direct access. Lastly it serves as
+ * a useful hint what code should be looked at more carefully because of
+ * concurrency concerns.
+ *
+ * To be useful they usually will need to be placed in memory shared between
+ * processes or threads, most frequently by embedding them in structs. Be
+ * careful to align atomic variables to their own size!
+ *
+ * Before variables of one these types can be used they needs to be
+ * initialized using the type's initialization function (pg_atomic_init_flag,
+ * pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global
+ * pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These
+ * initializations have to be performed before any (possibly concurrent)
+ * access is possible, otherwise the results are undefined.
+ *
+ * For several mathematical operations two variants exist: One that returns
+ * the old value before the operation was performed, and one that that returns
+ * the new value. *_fetch_<op> variants will return the old value,
+ * *_<op>_fetch the new one.
+ *
+ *
+ * To bring up postgres on a platform/compiler at the very least
+ * either one of
+ * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
+ * * pg_atomic_compare_exchange_u32()
+ * * pg_atomic_exchange_u32()
+ * and all of
+ * * pg_compiler_barrier(), pg_memory_barrier()a
+ * need to be implemented. There exist generic, hardware independent,
+ * implementations for several compilers which might be sufficient, although
+ * possibly not optimal, for a new platform.
+ *
+ * To add support for a new architecture review whether the compilers used on
+ * that platform already provide adequate support in the files included
+ * "Compiler specific" section below. If not either add an include for a new
+ * file adding generic support for your compiler there and/or add/update an
+ * architecture specific include in the "Architecture specific" section below.
+ *
+ * Operations that aren't implemented by either architecture or compiler
+ * specific code are implemented ontop of compare_exchange() loops or
+ * spinlocks.
+ *
+ * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
+ * whenever possible. Writing correct code using these facilities is hard.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomics.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMICS_H
+#define ATOMICS_H
+
+#define INSIDE_ATOMICS_H
+
+/*
+ * Architecture specific implementations/considerations.
+ */
+#if defined(__arm__) || defined(__arm) || \
+	defined(__aarch64__) || defined(__aarch64)
+#	include "storage/atomics-arch-arm.h"
+#elif defined(__i386__) || defined(__i386)
+#	include "storage/atomics-arch-i386.h"
+#elif defined(__x86_64__)
+#	include "storage/atomics-arch-amd64.h"
+#elif defined(__ia64__) || defined(__ia64)
+#	include "storage/atomics-arch-amd64.h"
+#elif defined(__ppc__) || defined(__powerpc__) || \
+	defined(__ppc64__) || defined(__powerpc64__)
+#	include "storage/atomics-arch-ppc.h"
+#elif defined(__alpha) || defined(__alpha__)
+#	include "storage/atomics-arch-alpha.h"
+#elif defined(__hppa) || defined(__hppa__)
+#	include "storage/atomics-arch-hppa.h"
+#endif
+
+/*
+ * Compiler specific, but architecture independent implementations
+ */
+
+/* gcc or compatible, including icc */
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
+#	include "storage/atomics-generic-gcc.h"
+/* MS Visual Studio */
+#elif defined(WIN32_ONLY_COMPILER)
+#	include "storage/atomics-generic-msvc.h"
+/* presumably only ac++ */
+#elif defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
+#	include "storage/atomics-generic-acc.h"
+/* sun studio */
+#elif defined(__SUNPRO_C) && defined(__ia64) && !defined(__GNUC__)
+#	include "storage/atomics-generic-sunpro.h"
+/* IBM's xlc */
+#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
+#	include "storage/atomics-generic-xlc.h"
+#else
+#	error "unsupported compiler"
+#endif
+
+/* if we have spinlocks, but not atomic ops, emulate them */
+#if defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && \
+	!defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+#	define PG_USE_ATOMICS_EMULATION
+#	include "storage/atomics-generic-spinlock.h"
+#endif
+
+
+/*
+ * Provide fallback implementations for operations that aren't provided if
+ * possible.
+ */
+#include "storage/atomics-generic.h"
+
+/*
+ * pg_compiler_barrier - prevent the compiler from moving code
+ *
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ */
+#define pg_compiler_barrier()	pg_compiler_barrier_impl()
+
+/*
+ * pg_memory_barrier - prevent the CPU from reordering memory access
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ */
+#define pg_memory_barrier()	pg_memory_barrier_impl()
+
+/*
+ * pg_(read|write)_barrier - prevent the CPU from reordering memory access
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.	Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+#define pg_read_barrier()	pg_read_barrier_impl()
+#define pg_write_barrier()	pg_write_barrier_impl()
+
+/*
+ * Spinloop delay -
+ */
+#define pg_spin_delay()	pg_spin_delay_impl()
+
+
+/*
+ * pg_atomic_init_flag - initialize lock
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_test_and_set_flag - TAS()
+ *
+ * Full barrier semantics. (XXX? Only acquire?)
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_unlocked_test_flag - TAS()
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_clear_flag - release lock set by TAS()
+ *
+ * Full barrier semantics. (XXX? Only release?)
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_init_u32 - initialize atomic variable
+ *
+ * Has to be done before any usage except when the atomic variable is declared
+ * statically in which case the variable is initialized to 0.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_read_u32 - read from atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr);
+
+/*
+ * pg_atomic_write_u32 - write to atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_exchange_u32 - exchange newval with current value
+ *
+ * Returns the old value of 'ptr' before the swap.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval);
+
+/*
+ * pg_atomic_compare_exchange_u32 - CAS operation
+ *
+ * Atomically compare the current value of ptr with *expected and store newval
+ * iff ptr and *expected have the same value. The current value of *ptr will
+ * always be stored in *expected.
+ *
+ * Return whether the values have been exchanged.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval);
+
+/*
+ * pg_atomic_fetch_add_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, uint32 add_);
+
+/*
+ * pg_atomic_fetch_sub_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_);
+
+/*
+ * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_);
+
+/*
+ * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_);
+
+/*
+ * pg_atomic_add_fetch_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 add_);
+
+/*
+ * pg_atomic_sub_fetch_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_);
+
+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, uint32 add_,
+							  uint32 until);
+
+/* ----
+ * The 64 bit operations have the same semantics as their 32bit counterparts
+ * if they are available. Test for their existance using the
+ * PG_HAVE_64_BIT_ATOMICS define.
+ * ----
+ */
+#ifdef PG_HAVE_64_BIT_ATOMICS
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val_);
+
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr);
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint32 *expected, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, uint64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, uint64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_);
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+/*
+ * The following functions are wrapper functions around the platform specific
+ * implementation of the atomic operations performing common checks.
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))
+
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_test_set_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_unlocked_test_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_init_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_write_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u32_impl(ptr);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	return pg_atomic_exchange_u32_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	CHECK_POINTER_ALIGNMENT(expected, 4);
+
+	return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_and_u32_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_or_u32_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_add_fetch_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, uint32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, uint32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+
+#ifdef PG_HAVE_64_BIT_ATOMICS
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_init_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u64_impl(ptr);
+}
+
+STATIC_IF_INLINE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_write_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	return pg_atomic_exchange_u64_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	CHECK_POINTER_ALIGNMENT(expected, 8);
+	return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, uint64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_add_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, uint64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_and_u64_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_or_u64_impl(ptr, or_);
+}
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#undef INSIDE_ATOMICS_H
+
+#endif /* ATOMICS_H */
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index 16a9a0c..607a3c9 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -13,137 +13,12 @@
 #ifndef BARRIER_H
 #define BARRIER_H
 
-#include "storage/s_lock.h"
-
-extern slock_t dummy_spinlock;
-
-/*
- * A compiler barrier need not (and preferably should not) emit any actual
- * machine code, but must act as an optimization fence: the compiler must not
- * reorder loads or stores to main memory around the barrier.  However, the
- * CPU may still reorder loads or stores at runtime, if the architecture's
- * memory model permits this.
- *
- * A memory barrier must act as a compiler barrier, and in addition must
- * guarantee that all loads and stores issued prior to the barrier are
- * completed before any loads or stores issued after the barrier.  Unless
- * loads and stores are totally ordered (which is not the case on most
- * architectures) this requires issuing some sort of memory fencing
- * instruction.
- *
- * A read barrier must act as a compiler barrier, and in addition must
- * guarantee that any loads issued prior to the barrier are completed before
- * any loads issued after the barrier.	Similarly, a write barrier acts
- * as a compiler barrier, and also orders stores.  Read and write barriers
- * are thus weaker than a full memory barrier, but stronger than a compiler
- * barrier.  In practice, on machines with strong memory ordering, read and
- * write barriers may require nothing more than a compiler barrier.
- *
- * For an introduction to using memory barriers within the PostgreSQL backend,
- * see src/backend/storage/lmgr/README.barrier
- */
-
-#if defined(DISABLE_BARRIERS)
-
-/*
- * Fall through to the spinlock-based implementation.
- */
-#elif defined(__INTEL_COMPILER)
-
-/*
- * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
- */
-#if defined(__ia64__) || defined(__ia64)
-#define pg_memory_barrier()		__mf()
-#elif defined(__i386__) || defined(__x86_64__)
-#define pg_memory_barrier()		_mm_mfence()
-#endif
-
-#define pg_compiler_barrier()	__memory_barrier()
-#elif defined(__GNUC__)
-
-/* This works on any architecture, since it's only talking to GCC itself. */
-#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
-
-#if defined(__i386__)
-
 /*
- * i386 does not allow loads to be reordered with other loads, or stores to be
- * reordered with other stores, but a load can be performed before a subsequent
- * store.
- *
- * "lock; addl" has worked for longer than "mfence".
+ * This used to be a separate file, full of compiler/architecture
+ * dependent defines, but it's not included in the atomics.h
+ * infrastructure and just kept for backward compatibility.
  */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__x86_64__)		/* 64 bit x86 */
-
-/*
- * x86_64 has similar ordering characteristics to i386.
- *
- * Technically, some x86-ish chips support uncached memory access and/or
- * special instructions that are weakly ordered.  In those cases we'd need
- * the read and write barriers to be lfence and sfence.  But since we don't
- * do those things, a compiler barrier should be enough.
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__ia64__) || defined(__ia64)
-
-/*
- * Itanium is weakly ordered, so read and write barriers require a full
- * fence.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-
-/*
- * lwsync orders loads with respect to each other, and similarly with stores.
- * But a load can be performed before a subsequent store, so sync must be used
- * for a full memory barrier.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#elif defined(__alpha) || defined(__alpha__)	/* Alpha */
-
-/*
- * Unlike all other known architectures, Alpha allows dependent reads to be
- * reordered, but we don't currently find it necessary to provide a conditional
- * read barrier to cover that case.  We might need to add that later.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
-#elif defined(__hppa) || defined(__hppa__)		/* HP PA-RISC */
-
-/* HPPA doesn't do either read or write reordering */
-#define pg_memory_barrier()		pg_compiler_barrier()
-#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
-
-/*
- * If we're on GCC 4.1.0 or higher, we should be able to get a memory
- * barrier out of this compiler built-in.  But we prefer to rely on our
- * own definitions where possible, and use this only as a fallback.
- */
-#define pg_memory_barrier()		__sync_synchronize()
-#endif
-#elif defined(__ia64__) || defined(__ia64)
-
-#define pg_compiler_barrier()	_Asm_sched_fence()
-#define pg_memory_barrier()		_Asm_mf()
-#elif defined(WIN32_ONLY_COMPILER)
-
-/* Should work on both MSVC and Borland. */
-#include <intrin.h>
-#pragma intrinsic(_ReadWriteBarrier)
-#define pg_compiler_barrier()	_ReadWriteBarrier()
-#define pg_memory_barrier()		MemoryBarrier()
-#endif
+#include "storage/atomics.h"
 
 /*
  * If we have no memory barrier implementation for this architecture, we
@@ -157,6 +32,10 @@ extern slock_t dummy_spinlock;
  * fence.  But all of our actual implementations seem OK in this regard.
  */
 #if !defined(pg_memory_barrier)
+#include "storage/s_lock.h"
+
+extern slock_t dummy_spinlock;
+
 #define pg_memory_barrier() \
 	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
 #endif
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 7dcd5d9..73aebfa 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -21,10 +21,6 @@
  *	void S_UNLOCK(slock_t *lock)
  *		Unlock a previously acquired lock.
  *
- *	bool S_LOCK_FREE(slock_t *lock)
- *		Tests if the lock is free. Returns TRUE if free, FALSE if locked.
- *		This does *not* change the state of the lock.
- *
  *	void SPIN_DELAY(void)
  *		Delay operation to occur inside spinlock wait loop.
  *
@@ -94,879 +90,17 @@
 #ifndef S_LOCK_H
 #define S_LOCK_H
 
+#include "storage/atomics.h"
 #include "storage/pg_sema.h"
 
-#ifdef HAVE_SPINLOCKS	/* skip spinlocks if requested */
-
-
-#if defined(__GNUC__) || defined(__INTEL_COMPILER)
-/*************************************************************************
- * All the gcc inlines
- * Gcc consistently defines the CPU as __cpu__.
- * Other compilers use __cpu or __cpu__ so we test for both in those cases.
- */
-
-/*----------
- * Standard gcc asm format (assuming "volatile slock_t *lock"):
-
-	__asm__ __volatile__(
-		"	instruction	\n"
-		"	instruction	\n"
-		"	instruction	\n"
-:		"=r"(_res), "+m"(*lock)		// return register, in/out lock value
-:		"r"(lock)					// lock pointer, in input register
-:		"memory", "cc");			// show clobbered registers here
-
- * The output-operands list (after first colon) should always include
- * "+m"(*lock), whether or not the asm code actually refers to this
- * operand directly.  This ensures that gcc believes the value in the
- * lock variable is used and set by the asm code.  Also, the clobbers
- * list (after third colon) should always include "memory"; this prevents
- * gcc from thinking it can cache the values of shared-memory fields
- * across the asm code.  Add "cc" if your asm code changes the condition
- * code register, and also list any temp registers the code uses.
- *----------
- */
-
-
-#ifdef __i386__		/* 32-bit i386 */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	/*
-	 * Use a non-locking test before asserting the bus lock.  Note that the
-	 * extra test appears to be a small loss on some x86 platforms and a small
-	 * win on others; it's by no means clear that we should keep it.
-	 *
-	 * When this was last tested, we didn't have separate TAS() and TAS_SPIN()
-	 * macros.  Nowadays it probably would be better to do a non-locking test
-	 * in TAS_SPIN() but not in TAS(), like on x86_64, but no-one's done the
-	 * testing to verify that.  Without some empirical evidence, better to
-	 * leave it alone.
-	 */
-	__asm__ __volatile__(
-		"	cmpb	$0,%1	\n"
-		"	jne		1f		\n"
-		"	lock			\n"
-		"	xchgb	%0,%1	\n"
-		"1: \n"
-:		"+q"(_res), "+m"(*lock)
-:
-:		"memory", "cc");
-	return (int) _res;
-}
-
-#define SPIN_DELAY() spin_delay()
-
-static __inline__ void
-spin_delay(void)
-{
-	/*
-	 * This sequence is equivalent to the PAUSE instruction ("rep" is
-	 * ignored by old IA32 processors if the following instruction is
-	 * not a string operation); the IA-32 Architecture Software
-	 * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
-	 * PAUSE in the inner loop of a spin lock is necessary for good
-	 * performance:
-	 *
-	 *     The PAUSE instruction improves the performance of IA-32
-	 *     processors supporting Hyper-Threading Technology when
-	 *     executing spin-wait loops and other routines where one
-	 *     thread is accessing a shared lock or semaphore in a tight
-	 *     polling loop. When executing a spin-wait loop, the
-	 *     processor can suffer a severe performance penalty when
-	 *     exiting the loop because it detects a possible memory order
-	 *     violation and flushes the core processor's pipeline. The
-	 *     PAUSE instruction provides a hint to the processor that the
-	 *     code sequence is a spin-wait loop. The processor uses this
-	 *     hint to avoid the memory order violation and prevent the
-	 *     pipeline flush. In addition, the PAUSE instruction
-	 *     de-pipelines the spin-wait loop to prevent it from
-	 *     consuming execution resources excessively.
-	 */
-	__asm__ __volatile__(
-		" rep; nop			\n");
-}
-
-#endif	 /* __i386__ */
-
-
-#ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-/*
- * On Intel EM64T, it's a win to use a non-locking test before the xchg proper,
- * but only when spinning.
- *
- * See also Implementing Scalable Atomic Locks for Multi-Core Intel(tm) EM64T
- * and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
- * available at:
- * http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures
- */
-#define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	__asm__ __volatile__(
-		"	lock			\n"
-		"	xchgb	%0,%1	\n"
-:		"+q"(_res), "+m"(*lock)
-:
-:		"memory", "cc");
-	return (int) _res;
-}
-
-#define SPIN_DELAY() spin_delay()
-
-static __inline__ void
-spin_delay(void)
-{
-	/*
-	 * Adding a PAUSE in the spin delay loop is demonstrably a no-op on
-	 * Opteron, but it may be of some use on EM64T, so we keep it.
-	 */
-	__asm__ __volatile__(
-		" rep; nop			\n");
-}
-
-#endif	 /* __x86_64__ */
-
-
-#if defined(__ia64__) || defined(__ia64)
-/*
- * Intel Itanium, gcc or Intel's compiler.
- *
- * Itanium has weak memory ordering, but we rely on the compiler to enforce
- * strict ordering of accesses to volatile data.  In particular, while the
- * xchg instruction implicitly acts as a memory barrier with 'acquire'
- * semantics, we do not have an explicit memory fence instruction in the
- * S_UNLOCK macro.  We use a regular assignment to clear the spinlock, and
- * trust that the compiler marks the generated store instruction with the
- * ".rel" opcode.
- *
- * Testing shows that assumption to hold on gcc, although I could not find
- * any explicit statement on that in the gcc manual.  In Intel's compiler,
- * the -m[no-]serialize-volatile option controls that, and testing shows that
- * it is enabled by default.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-/* On IA64, it's a win to use a non-locking test before the xchg proper */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-#ifndef __INTEL_COMPILER
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	long int	ret;
-
-	__asm__ __volatile__(
-		"	xchg4 	%0=%1,%2	\n"
-:		"=r"(ret), "+m"(*lock)
-:		"r"(1)
-:		"memory");
-	return (int) ret;
-}
-
-#else /* __INTEL_COMPILER */
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	int		ret;
-
-	ret = _InterlockedExchange(lock,1);	/* this is a xchg asm macro */
-
-	return ret;
-}
-
-#endif /* __INTEL_COMPILER */
-#endif	 /* __ia64__ || __ia64 */
-
-
-/*
- * On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if
- * not fall back on the SWPB instruction.  SWPB does not work on ARMv6 or
- * later, so the compiler builtin is preferred if available.  Note also that
- * the int-width variant of the builtin works on more chips than other widths.
- */
-#if defined(__arm__) || defined(__arm)
-#define HAS_TEST_AND_SET
-
-#define TAS(lock) tas(lock)
-
-#ifdef HAVE_GCC_INT_ATOMICS
-
-typedef int slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	return __sync_lock_test_and_set(lock, 1);
-}
-
-#define S_UNLOCK(lock) __sync_lock_release(lock)
-
-#else /* !HAVE_GCC_INT_ATOMICS */
-
-typedef unsigned char slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res = 1;
-
-	__asm__ __volatile__(
-		"	swpb 	%0, %0, [%2]	\n"
-:		"+r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return (int) _res;
-}
-
-#endif	 /* HAVE_GCC_INT_ATOMICS */
-#endif	 /* __arm__ */
-
-
-/*
- * On ARM64, we use __sync_lock_test_and_set(int *, int) if available.
- */
-#if defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC_INT_ATOMICS
-#define HAS_TEST_AND_SET
-
-#define TAS(lock) tas(lock)
-
-typedef int slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	return __sync_lock_test_and_set(lock, 1);
-}
-
-#define S_UNLOCK(lock) __sync_lock_release(lock)
-
-#endif	 /* HAVE_GCC_INT_ATOMICS */
-#endif	 /* __aarch64__ */
-
-
-/* S/390 and S/390x Linux (32- and 64-bit zSeries) */
-#if defined(__s390__) || defined(__s390x__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock)	   tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	int			_res = 0;
-
-	__asm__	__volatile__(
-		"	cs 	%0,%3,0(%2)		\n"
-:		"+d"(_res), "+m"(*lock)
-:		"a"(lock), "d"(1)
-:		"memory", "cc");
-	return _res;
-}
-
-#endif	 /* __s390__ || __s390x__ */
-
-
-#if defined(__sparc__)		/* Sparc */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res;
-
-	/*
-	 *	See comment in /pg/backend/port/tas/solaris_sparc.s for why this
-	 *	uses "ldstub", and that file uses "cas".  gcc currently generates
-	 *	sparcv7-targeted binaries, so "cas" use isn't possible.
-	 */
-	__asm__ __volatile__(
-		"	ldstub	[%2], %0	\n"
-:		"=r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return (int) _res;
-}
-
-#endif	 /* __sparc__ */
-
-
-/* PowerPC */
-#if defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-/* On PPC, it's a win to use a non-locking test before the lwarx */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-/*
- * NOTE: per the Enhanced PowerPC Architecture manual, v1.0 dated 7-May-2002,
- * an isync is a sufficient synchronization barrier after a lwarx/stwcx loop.
- * On newer machines, we can use lwsync instead for better performance.
- */
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	slock_t _t;
-	int _res;
-
-	__asm__ __volatile__(
-#ifdef USE_PPC_LWARX_MUTEX_HINT
-"	lwarx   %0,0,%3,1	\n"
-#else
-"	lwarx   %0,0,%3		\n"
-#endif
-"	cmpwi   %0,0		\n"
-"	bne     1f			\n"
-"	addi    %0,%0,1		\n"
-"	stwcx.  %0,0,%3		\n"
-"	beq     2f         	\n"
-"1:	li      %1,1		\n"
-"	b		3f			\n"
-"2:						\n"
-#ifdef USE_PPC_LWSYNC
-"	lwsync				\n"
-#else
-"	isync				\n"
-#endif
-"	li      %1,0		\n"
-"3:						\n"
-
-:	"=&r"(_t), "=r"(_res), "+m"(*lock)
-:	"r"(lock)
-:	"memory", "cc");
-	return _res;
-}
-
-/*
- * PowerPC S_UNLOCK is almost standard but requires a "sync" instruction.
- * On newer machines, we can use lwsync instead for better performance.
- */
-#ifdef USE_PPC_LWSYNC
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__ ("	lwsync \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-#else
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__ ("	sync \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-#endif /* USE_PPC_LWSYNC */
-
-#endif /* powerpc */
-
-
-/* Linux Motorola 68k */
-#if (defined(__mc68000__) || defined(__m68k__)) && defined(__linux__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int rv;
-
-	__asm__	__volatile__(
-		"	clrl	%0		\n"
-		"	tas		%1		\n"
-		"	sne		%0		\n"
-:		"=d"(rv), "+m"(*lock)
-:
-:		"memory", "cc");
-	return rv;
-}
-
-#endif	 /* (__mc68000__ || __m68k__) && __linux__ */
-
-
-/*
- * VAXen -- even multiprocessor ones
- * (thanks to Tom Ivar Helbekkmo)
- */
-#if defined(__vax__)
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int	_res;
-
-	__asm__ __volatile__(
-		"	movl 	$1, %0			\n"
-		"	bbssi	$0, (%2), 1f	\n"
-		"	clrl	%0				\n"
-		"1: \n"
-:		"=&r"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory");
-	return _res;
-}
-
-#endif	 /* __vax__ */
-
-#if defined(__alpha) || defined(__alpha__)	/* Alpha */
-/*
- * Correct multi-processor locking methods are explained in section 5.5.3
- * of the Alpha AXP Architecture Handbook, which at this writing can be
- * found at ftp://ftp.netbsd.org/pub/NetBSD/misc/dec-docs/index.html.
- * For gcc we implement the handbook's code directly with inline assembler.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned long slock_t;
-
-#define TAS(lock)  tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register slock_t _res;
-
-	__asm__	__volatile__(
-		"	ldq		$0, %1	\n"
-		"	bne		$0, 2f	\n"
-		"	ldq_l	%0, %1	\n"
-		"	bne		%0, 2f	\n"
-		"	mov		1,  $0	\n"
-		"	stq_c	$0, %1	\n"
-		"	beq		$0, 2f	\n"
-		"	mb				\n"
-		"	br		3f		\n"
-		"2:	mov		1, %0	\n"
-		"3:					\n"
-:		"=&r"(_res), "+m"(*lock)
-:
-:		"memory", "0");
-	return (int) _res;
-}
-
-#define S_UNLOCK(lock)	\
-do \
-{\
-	__asm__ __volatile__ ("	mb \n"); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-
-#endif /* __alpha || __alpha__ */
-
-
-#if defined(__mips__) && !defined(__sgi)	/* non-SGI MIPS */
-/* Note: on SGI we use the OS' mutex ABI, see below */
-/* Note: R10000 processors require a separate SYNC */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register volatile slock_t *_l = lock;
-	register int _res;
-	register int _tmp;
-
-	__asm__ __volatile__(
-		"       .set push           \n"
-		"       .set mips2          \n"
-		"       .set noreorder      \n"
-		"       .set nomacro        \n"
-		"       ll      %0, %2      \n"
-		"       or      %1, %0, 1   \n"
-		"       sc      %1, %2      \n"
-		"       xori    %1, 1       \n"
-		"       or      %0, %0, %1  \n"
-		"       sync                \n"
-		"       .set pop              "
-:		"=&r" (_res), "=&r" (_tmp), "+R" (*_l)
-:
-:		"memory");
-	return _res;
-}
-
-/* MIPS S_UNLOCK is almost standard but requires a "sync" instruction */
-#define S_UNLOCK(lock)	\
-do \
-{ \
-	__asm__ __volatile__( \
-		"       .set push           \n" \
-		"       .set mips2          \n" \
-		"       .set noreorder      \n" \
-		"       .set nomacro        \n" \
-		"       sync                \n" \
-		"       .set pop              "); \
-	*((volatile slock_t *) (lock)) = 0; \
-} while (0)
-
-#endif /* __mips__ && !__sgi */
-
-
-#if defined(__m32r__) && defined(HAVE_SYS_TAS_H)	/* Renesas' M32R */
-#define HAS_TEST_AND_SET
-
-#include <sys/tas.h>
-
-typedef int slock_t;
-
-#define TAS(lock) tas(lock)
-
-#endif /* __m32r__ */
-
-
-#if defined(__sh__)				/* Renesas' SuperH */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock) tas(lock)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	register int _res;
-
-	/*
-	 * This asm is coded as if %0 could be any register, but actually SuperH
-	 * restricts the target of xor-immediate to be R0.  That's handled by
-	 * the "z" constraint on _res.
-	 */
-	__asm__ __volatile__(
-		"	tas.b @%2    \n"
-		"	movt  %0     \n"
-		"	xor   #1,%0  \n"
-:		"=z"(_res), "+m"(*lock)
-:		"r"(lock)
-:		"memory", "t");
-	return _res;
-}
-
-#endif	 /* __sh__ */
-
-
-/* These live in s_lock.c, but only for gcc */
-
-
-#if defined(__m68k__) && !defined(__linux__)	/* non-Linux Motorola 68k */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-#endif
-
-
-#endif	/* defined(__GNUC__) || defined(__INTEL_COMPILER) */
-
-
-
-/*
- * ---------------------------------------------------------------------
- * Platforms that use non-gcc inline assembly:
- * ---------------------------------------------------------------------
- */
-
-#if !defined(HAS_TEST_AND_SET)	/* We didn't trigger above, let's try here */
-
-
-#if defined(USE_UNIVEL_CC)		/* Unixware compiler */
-#define HAS_TEST_AND_SET
-
-typedef unsigned char slock_t;
-
-#define TAS(lock)	tas(lock)
-
-asm int
-tas(volatile slock_t *s_lock)
-{
-/* UNIVEL wants %mem in column 1, so we don't pg_indent this file */
-%mem s_lock
-	pushl %ebx
-	movl s_lock, %ebx
-	movl $255, %eax
-	lock
-	xchgb %al, (%ebx)
-	popl %ebx
-}
-
-#endif	 /* defined(USE_UNIVEL_CC) */
-
-
-#if defined(__alpha) || defined(__alpha__)	/* Tru64 Unix Alpha compiler */
-/*
- * The Tru64 compiler doesn't support gcc-style inline asm, but it does
- * have some builtin functions that accomplish much the same results.
- * For simplicity, slock_t is defined as long (ie, quadword) on Alpha
- * regardless of the compiler in use.  LOCK_LONG and UNLOCK_LONG only
- * operate on an int (ie, longword), but that's OK as long as we define
- * S_INIT_LOCK to zero out the whole quadword.
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned long slock_t;
-
-#include <alpha/builtins.h>
-#define S_INIT_LOCK(lock)  (*(lock) = 0)
-#define TAS(lock)		   (__LOCK_LONG_RETRY((lock), 1) == 0)
-#define S_UNLOCK(lock)	   __UNLOCK_LONG(lock)
-
-#endif	 /* __alpha || __alpha__ */
-
-
-#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
-/*
- * HP's PA-RISC
- *
- * See src/backend/port/hpux/tas.c.template for details about LDCWX.  Because
- * LDCWX requires a 16-byte-aligned address, we declare slock_t as a 16-byte
- * struct.  The active word in the struct is whichever has the aligned address;
- * the other three words just sit at -1.
- *
- * When using gcc, we can inline the required assembly code.
- */
-#define HAS_TEST_AND_SET
-
-typedef struct
-{
-	int			sema[4];
-} slock_t;
-
-#define TAS_ACTIVE_WORD(lock)	((volatile int *) (((uintptr_t) (lock) + 15) & ~15))
-
-#if defined(__GNUC__)
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	volatile int *lockword = TAS_ACTIVE_WORD(lock);
-	register int lockval;
-
-	__asm__ __volatile__(
-		"	ldcwx	0(0,%2),%0	\n"
-:		"=r"(lockval), "+m"(*lockword)
-:		"r"(lockword)
-:		"memory");
-	return (lockval == 0);
-}
-
-#endif /* __GNUC__ */
-
-#define S_UNLOCK(lock)	(*TAS_ACTIVE_WORD(lock) = -1)
-
-#define S_INIT_LOCK(lock) \
-	do { \
-		volatile slock_t *lock_ = (lock); \
-		lock_->sema[0] = -1; \
-		lock_->sema[1] = -1; \
-		lock_->sema[2] = -1; \
-		lock_->sema[3] = -1; \
-	} while (0)
-
-#define S_LOCK_FREE(lock)	(*TAS_ACTIVE_WORD(lock) != 0)
-
-#endif	 /* __hppa || __hppa__ */
-
-
-#if defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
-/*
- * HP-UX on Itanium, non-gcc compiler
- *
- * We assume that the compiler enforces strict ordering of loads/stores on
- * volatile data (see comments on the gcc-version earlier in this file).
- * Note that this assumption does *not* hold if you use the
- * +Ovolatile=__unordered option on the HP-UX compiler, so don't do that.
- *
- * See also Implementing Spinlocks on the Intel Itanium Architecture and
- * PA-RISC, by Tor Ekqvist and David Graves, for more information.  As of
- * this writing, version 1.0 of the manual is available at:
- * http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
- */
-#define HAS_TEST_AND_SET
-
-typedef unsigned int slock_t;
-
-#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
-/* On IA64, it's a win to use a non-locking test before the xchg proper */
-#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
-
-#endif	/* HPUX on IA64, non gcc */
-
-#if defined(_AIX)	/* AIX */
-/*
- * AIX (POWER)
- */
-#define HAS_TEST_AND_SET
-
-#include <sys/atomic_op.h>
-
-typedef int slock_t;
-
-#define TAS(lock)			_check_lock((slock_t *) (lock), 0, 1)
-#define S_UNLOCK(lock)		_clear_lock((slock_t *) (lock), 0)
-#endif	 /* _AIX */
-
-
-/* These are in s_lock.c */
-
-#if defined(__SUNPRO_C) && (defined(__i386) || defined(__x86_64__) || defined(__sparc__) || defined(__sparc))
-#define HAS_TEST_AND_SET
-
-#if defined(__i386) || defined(__x86_64__) || defined(__sparcv9) || defined(__sparcv8plus)
-typedef unsigned int slock_t;
-#else
-typedef unsigned char slock_t;
-#endif
-
-extern slock_t pg_atomic_cas(volatile slock_t *lock, slock_t with,
-									  slock_t cmp);
-
-#define TAS(a) (pg_atomic_cas((a), 1, 0) != 0)
-#endif
-
-
-#ifdef WIN32_ONLY_COMPILER
-typedef LONG slock_t;
-
-#define HAS_TEST_AND_SET
-#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
-
-#define SPIN_DELAY() spin_delay()
-
-/* If using Visual C++ on Win64, inline assembly is unavailable.
- * Use a _mm_pause instrinsic instead of rep nop.
- */
-#if defined(_WIN64)
-static __forceinline void
-spin_delay(void)
-{
-	_mm_pause();
-}
-#else
-static __forceinline void
-spin_delay(void)
-{
-	/* See comment for gcc code. Same code, MASM syntax */
-	__asm rep nop;
-}
-#endif
-
-#endif
-
-
-#endif	/* !defined(HAS_TEST_AND_SET) */
-
-
-/* Blow up if we didn't have any way to do spinlocks */
-#ifndef HAS_TEST_AND_SET
-#error PostgreSQL does not have native spinlock support on this platform.  To continue the compilation, rerun configure using --disable-spinlocks.  However, performance will be poor.  Please report this to pgsql-bugs@postgresql.org.
-#endif
-
-
-#else	/* !HAVE_SPINLOCKS */
-
-
-/*
- * Fake spinlock implementation using semaphores --- slow and prone
- * to fall foul of kernel limits on number of semaphores, so don't use this
- * unless you must!  The subroutines appear in spin.c.
- */
-typedef PGSemaphoreData slock_t;
-
-extern bool s_lock_free_sema(volatile slock_t *lock);
-extern void s_unlock_sema(volatile slock_t *lock);
-extern void s_init_lock_sema(volatile slock_t *lock);
-extern int	tas_sema(volatile slock_t *lock);
-
-#define S_LOCK_FREE(lock)	s_lock_free_sema(lock)
-#define S_UNLOCK(lock)	 s_unlock_sema(lock)
-#define S_INIT_LOCK(lock)	s_init_lock_sema(lock)
-#define TAS(lock)	tas_sema(lock)
-
-
-#endif	/* HAVE_SPINLOCKS */
-
-
-/*
- * Default Definitions - override these above as needed.
- */
-
-#if !defined(S_LOCK)
+typedef pg_atomic_flag slock_t;
+#define TAS(lock)			(!pg_atomic_test_set_flag(lock))
+#define TAS_SPIN(lock)		(pg_atomic_unlocked_test_flag(lock) ? 1 : TAS(lock))
+#define S_UNLOCK(lock)		pg_atomic_clear_flag(lock)
+#define S_INIT_LOCK(lock)	pg_atomic_init_flag(lock)
+#define SPIN_DELAY()		pg_spin_delay()
 #define S_LOCK(lock) \
 	(TAS(lock) ? s_lock((lock), __FILE__, __LINE__) : 0)
-#endif	 /* S_LOCK */
-
-#if !defined(S_LOCK_FREE)
-#define S_LOCK_FREE(lock)	(*(lock) == 0)
-#endif	 /* S_LOCK_FREE */
-
-#if !defined(S_UNLOCK)
-#define S_UNLOCK(lock)		(*((volatile slock_t *) (lock)) = 0)
-#endif	 /* S_UNLOCK */
-
-#if !defined(S_INIT_LOCK)
-#define S_INIT_LOCK(lock)	S_UNLOCK(lock)
-#endif	 /* S_INIT_LOCK */
-
-#if !defined(SPIN_DELAY)
-#define SPIN_DELAY()	((void) 0)
-#endif	 /* SPIN_DELAY */
-
-#if !defined(TAS)
-extern int	tas(volatile slock_t *lock);		/* in port/.../tas.s, or
-												 * s_lock.c */
-
-#define TAS(lock)		tas(lock)
-#endif	 /* TAS */
-
-#if !defined(TAS_SPIN)
-#define TAS_SPIN(lock)	TAS(lock)
-#endif	 /* TAS_SPIN */
-
 
 /*
  * Platform-independent out-of-line support routines
diff --git a/src/test/regress/expected/lock.out b/src/test/regress/expected/lock.out
index 0d7c3ba..fd27344 100644
--- a/src/test/regress/expected/lock.out
+++ b/src/test/regress/expected/lock.out
@@ -60,3 +60,11 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
+ test_atomic_ops 
+-----------------
+ t
+(1 row)
+
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index aef1518..1fded84 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -57,6 +57,11 @@ CREATE FUNCTION make_tuple_indirect (record)
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
 
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
+
 -- Things that shouldn't work:
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index 9761d12..9926c90 100644
--- a/src/test/regress/output/create_function_1.source
+++ b/src/test/regress/output/create_function_1.source
@@ -51,6 +51,10 @@ CREATE FUNCTION make_tuple_indirect (record)
         RETURNS record
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 3bd8a15..1db996a 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -16,6 +16,7 @@
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/spi.h"
+#include "storage/atomics.h"
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/rel.h"
@@ -40,6 +41,7 @@ extern int	oldstyle_length(int n, text *t);
 extern Datum int44in(PG_FUNCTION_ARGS);
 extern Datum int44out(PG_FUNCTION_ARGS);
 extern Datum make_tuple_indirect(PG_FUNCTION_ARGS);
+extern Datum test_atomic_ops(PG_FUNCTION_ARGS);
 
 #ifdef PG_MODULE_MAGIC
 PG_MODULE_MAGIC;
@@ -829,3 +831,214 @@ make_tuple_indirect(PG_FUNCTION_ARGS)
 
 	PG_RETURN_HEAPTUPLEHEADER(newtup->t_data);
 }
+
+
+static pg_atomic_flag global_flag = PG_ATOMIC_INIT_FLAG();
+
+static void
+test_atomic_flag(void)
+{
+	pg_atomic_flag flag;
+
+	/* check the global flag */
+	if (pg_atomic_unlocked_test_flag(&global_flag))
+		elog(ERROR, "global_flag: unduly set");
+
+	if (!pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: couldn't set");
+
+	if (pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: set spuriously #1");
+
+	pg_atomic_clear_flag(&global_flag);
+
+	if (!pg_atomic_test_set_flag(&global_flag))
+		elog(ERROR, "global_flag: couldn't set");
+
+	/* check the local flag */
+	pg_atomic_init_flag(&flag);
+
+	if (pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unduly set");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	if (pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: set spuriously #2");
+
+	pg_atomic_clear_flag(&flag);
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+}
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static void
+test_atomic_uint32(void)
+{
+	pg_atomic_uint32 var;
+	uint32 expected;
+	int i;
+
+	pg_atomic_init_u32(&var, 0);
+
+	if (pg_atomic_read_u32(&var) != 0)
+		elog(ERROR, "atomic_read_u32() #1 wrong");
+
+	pg_atomic_write_u32(&var, 3);
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u32() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u32() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u32(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u32() #1 wrong");
+
+	if (pg_atomic_add_fetch_u32(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u32() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u32(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u32() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 1000; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u32(&var, &expected, 1))
+			break;
+	}
+	if (i == 1000)
+		elog(ERROR, "atomic_compare_exchange_u32() never succeeded");
+	if (pg_atomic_read_u32(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u32() didn't set value properly");
+
+	pg_atomic_write_u32(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u32(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u32() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u32(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u32()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u32(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #1 wrong");
+
+	if (pg_atomic_fetch_and_u32(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #2 wrong: is %u",
+			 pg_atomic_read_u32(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u32(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 */
+
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static void
+test_atomic_uint64(void)
+{
+	pg_atomic_uint64 var;
+	uint64 expected;
+	int i;
+
+	pg_atomic_init_u64(&var, 0);
+
+	if (pg_atomic_read_u64(&var) != 0)
+		elog(ERROR, "atomic_read_u64() #1 wrong");
+
+	pg_atomic_write_u64(&var, 3);
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "atomic_read_u64() #2 wrong");
+
+	if (pg_atomic_fetch_add_u64(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u64() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u64(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u64() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u64(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u64() #1 wrong");
+
+	if (pg_atomic_add_fetch_u64(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u64() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u64(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u64() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 100; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u64(&var, &expected, 1))
+			break;
+	}
+	if (i == 100)
+		elog(ERROR, "atomic_compare_exchange_u64() never succeeded");
+	if (pg_atomic_read_u64(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u64() didn't set value properly");
+
+	pg_atomic_write_u64(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u64(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u64() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u64(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u64() #2 wrong");
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u64()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u64(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #1 wrong");
+
+	if (pg_atomic_fetch_and_u64(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #2 wrong: is %u",
+			 pg_atomic_read_u64(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u64(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+
+extern Datum test_atomic_ops(PG_FUNCTION_ARGS)
+{
+	test_atomic_flag();
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+	test_atomic_uint32();
+#endif
+#ifdef PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+	test_atomic_uint64();
+#endif
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/test/regress/sql/lock.sql b/src/test/regress/sql/lock.sql
index dda212f..567e8bc 100644
--- a/src/test/regress/sql/lock.sql
+++ b/src/test/regress/sql/lock.sql
@@ -64,3 +64,8 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+
+
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
-- 
1.8.3.251.g1462b67

#21Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#20)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 11:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:

/*-------------------------------------------------------------------------
*
* atomics.h
* Generic atomic operations support.
*
* Hardware and compiler dependent functions for manipulating memory
* atomically and dealing with cache coherency. Used to implement locking
* facilities and other concurrency safe constructs.
*
* There's three data types which can be manipulated atomically:
*
* * pg_atomic_flag - supports atomic test/clear operations, useful for
* implementing spinlocks and similar constructs.
*
* * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated
* by several processes at the same time.
*
* * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support
* can be tested by checking for PG_HAVE_64_BIT_ATOMICS.
*
* The values stored in theses cannot be accessed using the API provided in
* thise file, i.e. it is not possible to write statements like `var +=
* 10`. Instead, for this example, one would have to use
* `pg_atomic_fetch_add_u32(&var, 10)`.
*
* This restriction has several reasons: Primarily it doesn't allow non-atomic
* math to be performed on these values which wouldn't be concurrency
* safe. Secondly it allows to implement these types ontop of facilities -
* like C11's atomics - that don't provide direct access. Lastly it serves as
* a useful hint what code should be looked at more carefully because of
* concurrency concerns.
*
* To be useful they usually will need to be placed in memory shared between
* processes or threads, most frequently by embedding them in structs. Be
* careful to align atomic variables to their own size!

What does that mean exactly?

* Before variables of one these types can be used they needs to be
* initialized using the type's initialization function (pg_atomic_init_flag,
* pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global
* pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These
* initializations have to be performed before any (possibly concurrent)
* access is possible, otherwise the results are undefined.
*
* For several mathematical operations two variants exist: One that returns
* the old value before the operation was performed, and one that that returns
* the new value. *_fetch_<op> variants will return the old value,
* *_<op>_fetch the new one.

Ugh. Do we really need this much complexity?

* Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
* whenever possible. Writing correct code using these facilities is hard.
*
* For an introduction to using memory barriers within the PostgreSQL backend,
* see src/backend/storage/lmgr/README.barrier

The extent to which these primitives themselves provide ordering
guarantees should be documented.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#21)
Re: better atomics - v0.2

On 2013-11-19 12:43:44 -0500, Robert Haas wrote:

* To be useful they usually will need to be placed in memory shared between
* processes or threads, most frequently by embedding them in structs. Be
* careful to align atomic variables to their own size!

What does that mean exactly?

The alignment part? A pg_atomic_flag needs to be aligned to
sizeof(pg_atomic_flag), pg_atomic_uint32,64s to their sizeof()
respectively. When embedding them inside a struct the whole struct needs
to be allocated at least at that alignment.
Not sure how to describe that concisely.

I've added wrapper functions around the implementation that test for
alignment to make sure it's easy to spot such mistakes which could end
up having ugly results on some platforms.

* For several mathematical operations two variants exist: One that returns
* the old value before the operation was performed, and one that that returns
* the new value. *_fetch_<op> variants will return the old value,
* *_<op>_fetch the new one.

Ugh. Do we really need this much complexity?

Well, based on my experience it's really what you need when writing code
using it. The _<op>_fetch variants are implemented generically using the
_fetch_<op> operations, so leaving them out wouldn't win us much. What
would you like to remove?

* Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
* whenever possible. Writing correct code using these facilities is hard.
*
* For an introduction to using memory barriers within the PostgreSQL backend,
* see src/backend/storage/lmgr/README.barrier

The extent to which these primitives themselves provide ordering
guarantees should be documented.

Every function should have a comment to that regard, although two have
an XXX where I am not sure what we want to mandate, specifically whether
pg_atomic_test_set_flag() should be a full or acquire barrier and
whether pg_atomic_clear_flag should be full or release.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#17)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Oh, man, my fastgetattr() macro is going to be simplified. All my good
work gets rewritten. ;-)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@2ndquadrant.com
In reply to: Bruce Momjian (#23)
Re: better atomics - v0.2

On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Oh, man, my fastgetattr() macro is going to be simplified. All my good
work gets rewritten. ;-)

That and HeapKeyTest() alone are sufficient reason for this ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#24)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:

On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Oh, man, my fastgetattr() macro is going to be simplified. All my good
work gets rewritten. ;-)

That and HeapKeyTest() alone are sufficient reason for this ;)

Has there been any performance testing on this rewrite to use atomics?
If so, can I missed it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Andres Freund
andres@2ndquadrant.com
In reply to: Bruce Momjian (#25)
Re: better atomics - v0.2

On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:

On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Oh, man, my fastgetattr() macro is going to be simplified. All my good
work gets rewritten. ;-)

That and HeapKeyTest() alone are sufficient reason for this ;)

Has there been any performance testing on this rewrite to use atomics?
If so, can I missed it.

Do you mean inline? Or atomics? If the former no, if the latter
yes. I've started on it because of
http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#26)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote:

On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:

On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:

On 2013-11-19 10:30:24 -0500, Tom Lane wrote:

I don't have an informed opinion about requiring inline support
(although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Oh, man, my fastgetattr() macro is going to be simplified. All my good
work gets rewritten. ;-)

That and HeapKeyTest() alone are sufficient reason for this ;)

Has there been any performance testing on this rewrite to use atomics?
If so, can I missed it.

Do you mean inline? Or atomics? If the former no, if the latter
yes. I've started on it because of
http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

Yes, I was wondering about atomics. I think we know the performance
characteristics of inlining.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Andres Freund
andres@2ndquadrant.com
In reply to: Bruce Momjian (#27)
Re: better atomics - v0.2

On 2013-11-19 17:25:21 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote:

On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:

On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:

Do you mean inline? Or atomics? If the former no, if the latter
yes. I've started on it because of
http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

Yes, I was wondering about atomics. I think we know the performance
characteristics of inlining.

In that case it really depends on what we do with the atomics, providing
the abstraction itself shouldn't change performance at all, but the
possible scalability wins of using them in some critical paths are
pretty huge.

From a prototype I have, ontop of the number in the above post, removing
the spinlock from PinBuffer and UnpinBuffer() (only the
!BM_PIN_COUNT_WAITER path) gives another factor of two on the used
machine, although that required limiting MAX_BACKENDS to 0xfffff instead
of the current 0x7fffff.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: better atomics - v0.2

On 2013-11-19 10:37:35 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

Or just not worry about it, if it's only a warning?

So, my suggestion on that end is that we remove the requirement for
quiet inline now and see if it that has any negative consequences on the
buildfarm for a week or so. Imo that's a good idea regardless of us
relying on inlining support.
Does anyone have anything against that plan? If not, I'll prepare a
patch.

Or does the warning
mean code bloat (lots of useless copies of the inline function)?

After thinking on that for a bit, yes that's a possible consequence, but
it's quite possible that it happens in cases where we don't get the
warning too, so I don't think that argument has too much bearing as I
don't recall a complaint about it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#5)
Re: better atomics - v0.2

This patch didn't make it out of the 2013-11 commit fest. You should
move it to the next commit fest (probably with an updated patch)
before January 15th.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#29)
Re: better atomics - v0.2

On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-19 10:37:35 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

Or just not worry about it, if it's only a warning?

So, my suggestion on that end is that we remove the requirement for
quiet inline now and see if it that has any negative consequences on the
buildfarm for a week or so. Imo that's a good idea regardless of us
relying on inlining support.
Does anyone have anything against that plan? If not, I'll prepare a
patch.

Or does the warning
mean code bloat (lots of useless copies of the inline function)?

After thinking on that for a bit, yes that's a possible consequence, but
it's quite possible that it happens in cases where we don't get the
warning too, so I don't think that argument has too much bearing as I
don't recall a complaint about it.

But I bet the warning we're getting there is complaining about exactly
that thing. It's possible that it means "warning: i've optimized away
your unused function into nothing" but I think it's more likely that
it means "warning: i just emitted dead code". Indeed, I suspect that
this is why that test is written that way in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#31)
Re: better atomics - v0.2

On 2013-12-23 15:04:08 -0500, Robert Haas wrote:

On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-19 10:37:35 -0500, Tom Lane wrote:

Or does the warning
mean code bloat (lots of useless copies of the inline function)?

After thinking on that for a bit, yes that's a possible consequence, but
it's quite possible that it happens in cases where we don't get the
warning too, so I don't think that argument has too much bearing as I
don't recall a complaint about it.

But I bet the warning we're getting there is complaining about exactly
that thing. It's possible that it means "warning: i've optimized away
your unused function into nothing" but I think it's more likely that
it means "warning: i just emitted dead code". Indeed, I suspect that
this is why that test is written that way in the first place.

I don't think that's particularly likely - remember, we're talking about
static inline functions here. So the compiler would have to detect that
there's an inline function which won't be used and then explicitly emit
a function body. That seems like an odd thing to do.
So, if there's compilers like that around, imo they have to deal with their
own problems. It's not like it will prohibit using postgres.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Marti Raudsepp
marti@juffo.org
In reply to: Andres Freund (#20)
Re: better atomics - v0.2

On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The attached patches compile and make check successfully on linux x86,
amd64 and msvc x86 and amd64. This time and updated configure is
included. The majority of operations still rely on CAS emulation.

Note that the atomics-generic-acc.h file has ® characters in the
Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
characters, I think you should make sure they're UTF-8

Regards,
Marti

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Robert Haas
robertmhaas@gmail.com
In reply to: Marti Raudsepp (#33)
Re: better atomics - v0.2

On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The attached patches compile and make check successfully on linux x86,
amd64 and msvc x86 and amd64. This time and updated configure is
included. The majority of operations still rely on CAS emulation.

Note that the atomics-generic-acc.h file has ® characters in the
Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
characters, I think you should make sure they're UTF-8

I think we're better off avoiding them altogether. What's wrong with (R)?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#34)
Re: better atomics - v0.2

On 2014-01-21 10:20:35 -0500, Robert Haas wrote:

On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote:

On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The attached patches compile and make check successfully on linux x86,
amd64 and msvc x86 and amd64. This time and updated configure is
included. The majority of operations still rely on CAS emulation.

Note that the atomics-generic-acc.h file has ® characters in the
Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
characters, I think you should make sure they're UTF-8

I think we're better off avoiding them altogether. What's wrong with (R)?

Nothing at all. That was just the copied title from the pdf, it's gone
now.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#1)
1 attachment(s)
better atomics - v0.5

Hi,

[sorry for the second copy Robert]

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up. All try to implement TAS, 32 bit
atomic add, 32 bit compare exchange; some do it for 64bit as well.

* x86 gcc inline assembly means that we support every gcc version on x86

= i486; even when the __sync APIs aren't available yet.

* 'inline' support isn't required anymore. We fall back to
ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery.

* When the current platform doesn't have support for atomic operations a
spinlock backed implementation is used. This can be forced using
--disable-atomics.

That even works when semaphores are used to implement spinlocks (a
separate array is used there to avoid nesting problems). It contrast
to an earlier implementation this even works when atomics are mapped
to different addresses in individual processes (think dsm).

* s_lock.h falls back to the atomics.h provided APIs iff it doesn't have
native support for the current platform. This can be forced by
defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler
intrinsics based implementations that should make it easier to bring
up postgres on different platfomrs.

* I tried to improve the documentation of the facilities in
src/include/storage/atomics.h. Including documentation of the various
barrier semantics.

* There's tests in regress.c that get call via a SQL function from the
regression tests.

* Lots of other details changed, but that's the major pieces.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Basic-atomic-ops-implementation.patchtext/x-patch; charset=iso-8859-1Download
>From aae0351036bf8e66957387cca0cc91164e1ced53 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2014 10:53:20 +0100
Subject: [PATCH] Basic atomic ops implementation.

Only gcc and msvc has been tested, stub implementations for
* HUPX acc
* IBM xlc
* Sun/Oracle Sunpro compiler
are included.
---
 config/c-compiler.m4                             | 111 +++++
 configure                                        | 239 ++++++++--
 configure.in                                     |  31 +-
 src/backend/main/main.c                          |   6 -
 src/backend/storage/lmgr/Makefile                |   2 +-
 src/backend/storage/lmgr/atomics.c               | 127 +++++
 src/backend/storage/lmgr/spin.c                  |  22 +-
 src/include/c.h                                  |   7 +
 src/include/pg_config.h.in                       |  18 +-
 src/include/pg_config.h.win32                    |   3 +
 src/include/pg_config_manual.h                   |  15 +
 src/include/storage/atomics-arch-alpha.h         |  25 +
 src/include/storage/atomics-arch-amd64.h         | 184 ++++++++
 src/include/storage/atomics-arch-arm.h           |  29 ++
 src/include/storage/atomics-arch-hppa.h          |  17 +
 src/include/storage/atomics-arch-i386.h          | 162 +++++++
 src/include/storage/atomics-arch-ia64.h          |  27 ++
 src/include/storage/atomics-arch-ppc.h           |  25 +
 src/include/storage/atomics-fallback.h           | 129 ++++++
 src/include/storage/atomics-generic-acc.h        |  97 ++++
 src/include/storage/atomics-generic-gcc.h        | 218 +++++++++
 src/include/storage/atomics-generic-msvc.h       |  90 ++++
 src/include/storage/atomics-generic-sunpro.h     |  87 ++++
 src/include/storage/atomics-generic-xlc.h        |  97 ++++
 src/include/storage/atomics-generic.h            | 414 +++++++++++++++++
 src/include/storage/atomics.h                    | 564 +++++++++++++++++++++++
 src/include/storage/barrier.h                    | 164 +------
 src/include/storage/lwlock.h                     |   1 +
 src/include/storage/s_lock.h                     |  25 +-
 src/test/regress/expected/lock.out               |   8 +
 src/test/regress/input/create_function_1.source  |   5 +
 src/test/regress/output/create_function_1.source |   4 +
 src/test/regress/regress.c                       | 225 +++++++++
 src/test/regress/sql/lock.sql                    |   5 +
 34 files changed, 2954 insertions(+), 229 deletions(-)
 create mode 100644 src/backend/storage/lmgr/atomics.c
 create mode 100644 src/include/storage/atomics-arch-alpha.h
 create mode 100644 src/include/storage/atomics-arch-amd64.h
 create mode 100644 src/include/storage/atomics-arch-arm.h
 create mode 100644 src/include/storage/atomics-arch-hppa.h
 create mode 100644 src/include/storage/atomics-arch-i386.h
 create mode 100644 src/include/storage/atomics-arch-ia64.h
 create mode 100644 src/include/storage/atomics-arch-ppc.h
 create mode 100644 src/include/storage/atomics-fallback.h
 create mode 100644 src/include/storage/atomics-generic-acc.h
 create mode 100644 src/include/storage/atomics-generic-gcc.h
 create mode 100644 src/include/storage/atomics-generic-msvc.h
 create mode 100644 src/include/storage/atomics-generic-sunpro.h
 create mode 100644 src/include/storage/atomics-generic-xlc.h
 create mode 100644 src/include/storage/atomics-generic.h
 create mode 100644 src/include/storage/atomics.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 802f553..1cb8979 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -300,3 +300,114 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_PROG_CC_LDFLAGS_OPT
+
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_ATOMICS],
+[AC_CACHE_CHECK(for __builtin_constant_p, pgac_cv__builtin_constant_p,
+[AC_TRY_COMPILE([static int x; static int y[__builtin_constant_p(x) ? x : 1];],
+[],
+[pgac_cv__builtin_constant_p=yes],
+[pgac_cv__builtin_constant_p=no])])
+if test x"$pgac_cv__builtin_constant_p" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_CONSTANT_P, 1,
+          [Define to 1 if your compiler understands __builtin_constant_p.])
+fi])# PGAC_C_BUILTIN_CONSTANT_P
+
+# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(char),
+# and define HAVE_GCC__SYNC_CHAR_TAS
+#
+# NB: There are platforms where test_and_set is available but compare_and_swap
+# is not, so test this separately.
+# NB: Some platforms only do 32bit tas, others only do 8bit tas. Test both.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_CHAR_TAS],
+[AC_CACHE_CHECK(for builtin locking functions, pgac_cv_gcc_sync_char_tas,
+[AC_TRY_LINK([],
+  [char lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_char_tas="yes"],
+  [pgac_cv_gcc_sync_char_tas="no"])])
+if test x"$pgac_cv_gcc_sync_char_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_CHAR_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(char *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(),
+# and define HAVE_GCC__SYNC_INT32_TAS
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_TAS],
+[AC_CACHE_CHECK(for builtin locking functions, pgac_cv_gcc_sync_int32_tas,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_int32_tas="yes"],
+  [pgac_cv_gcc_sync_int32_tas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 32bit
+# types, and define HAVE_GCC__SYNC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __sync int32 atomic operations, pgac_cv_gcc_sync_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);],
+  [pgac_cv_gcc_sync_int32_cas="yes"],
+  [pgac_cv_gcc_sync_int32_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int *, int, int).])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_CAS
+
+# PGAC_HAVE_GCC__SYNC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 64bit
+# types, and define HAVE_GCC__SYNC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __sync int64 atomic operations, pgac_cv_gcc_sync_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);],
+  [pgac_cv_gcc_sync_int64_cas="yes"],
+  [pgac_cv_gcc_sync_int64_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64).])
+fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+# -------------------------
+
+# Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
+# types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int32_cas="yes"],
+  [pgac_cv_gcc_atomic_int32_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
+# -------------------------
+
+# Check if the C compiler understands __atomic_compare_exchange_n() for 64bit
+# types, and define HAVE_GCC__ATOMIC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int64 atomic operations, pgac_cv_gcc_atomic_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int64_cas="yes"],
+  [pgac_cv_gcc_atomic_int64_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
diff --git a/configure b/configure
index da89c69..6b7e278 100755
--- a/configure
+++ b/configure
@@ -802,6 +802,7 @@ enable_nls
 with_pgport
 enable_rpath
 enable_spinlocks
+enable_atomics
 enable_debug
 enable_profiling
 enable_coverage
@@ -1470,6 +1471,7 @@ Optional Features:
   --disable-rpath         do not embed shared library search path in
                           executables
   --disable-spinlocks     do not use spinlocks
+  --disable-atomics       do not use atomic operations
   --enable-debug          build with debugging symbols (-g)
   --enable-profiling      build with profiling enabled
   --enable-coverage       build with coverage testing instrumentation
@@ -3147,6 +3149,33 @@ fi
 
 
 #
+# Atomic operations
+#
+
+
+# Check whether --enable-atomics was given.
+if test "${enable_atomics+set}" = set; then :
+  enableval=$enable_atomics;
+  case $enableval in
+    yes)
+      :
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --enable-atomics option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  enable_atomics=yes
+
+fi
+
+
+
+#
 # --enable-debug adds -g to compiler flags
 #
 
@@ -8350,6 +8379,17 @@ $as_echo "$as_me: WARNING:
 *** Not using spinlocks will cause poor performance." >&2;}
 fi
 
+if test "$enable_atomics" = yes; then
+
+$as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
+
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
+*** Not using atomic operations will cause poor performance." >&5
+$as_echo "$as_me: WARNING:
+*** Not using atomic operations will cause poor performance." >&2;}
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing gss_init_sec_context" >&5
@@ -12104,40 +12144,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if ${pgac_cv_gcc_int_atomics+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_gcc_int_atomics="yes"
-else
-  pgac_cv_gcc_int_atomics="no"
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-$as_echo "#define HAVE_GCC_INT_ATOMICS 1" >>confdefs.h
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -13807,6 +13813,171 @@ _ACEOF
 fi
 
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin locking functions" >&5
+$as_echo_n "checking for builtin locking functions... " >&6; }
+if ${pgac_cv_gcc_sync_int32_tas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_tas="yes"
+else
+  pgac_cv_gcc_sync_int32_tas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_TAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int32 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_cas="yes"
+else
+  pgac_cv_gcc_sync_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int64 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int64_cas="yes"
+else
+  pgac_cv_gcc_sync_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT64_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int32 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int32_cas="yes"
+else
+  pgac_cv_gcc_atomic_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int64 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int64_cas="yes"
+else
+  pgac_cv_gcc_atomic_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT64_CAS 1" >>confdefs.h
+
+fi
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/configure.in b/configure.in
index 7dfbe9f..743fad5 100644
--- a/configure.in
+++ b/configure.in
@@ -180,6 +180,12 @@ PGAC_ARG_BOOL(enable, spinlocks, yes,
               [do not use spinlocks])
 
 #
+# Atomic operations
+#
+PGAC_ARG_BOOL(enable, atomics, yes,
+              [do not use atomic operations])
+
+#
 # --enable-debug adds -g to compiler flags
 #
 PGAC_ARG_BOOL(enable, debug, no,
@@ -937,6 +943,13 @@ else
 *** Not using spinlocks will cause poor performance.])
 fi
 
+if test "$enable_atomics" = yes; then
+  AC_DEFINE(HAVE_ATOMICS, 1, [Define to 1 if you have atomics.])
+else
+  AC_MSG_WARN([
+*** Not using atomic operations will cause poor performance.])
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     AC_SEARCH_LIBS(gss_init_sec_context, [gssapi_krb5 gss 'gssapi -lkrb5 -lcrypto'], [],
@@ -1467,17 +1480,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1760,6 +1762,13 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+PGAC_HAVE_GCC__SYNC_INT32_TAS
+PGAC_HAVE_GCC__SYNC_INT32_CAS
+PGAC_HAVE_GCC__SYNC_INT64_CAS
+PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c6fb8c9..a46c5b3 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -290,12 +290,6 @@ startup_hacks(const char *progname)
 		SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
 	}
 #endif   /* WIN32 */
-
-	/*
-	 * Initialize dummy_spinlock, in case we are on a platform where we have
-	 * to use the fallback implementation of pg_memory_barrier().
-	 */
-	SpinLockInit(&dummy_spinlock);
 }
 
 
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..be95d50 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/storage/lmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
+OBJS = atomics.o lmgr.o lock.o proc.o deadlock.o lwlock.o spin.o s_lock.o predicate.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/storage/lmgr/atomics.c b/src/backend/storage/lmgr/atomics.c
new file mode 100644
index 0000000..7b66e87
--- /dev/null
+++ b/src/backend/storage/lmgr/atomics.c
@@ -0,0 +1,127 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.c
+ *	   Non-Inline parts of the atomics implementation
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/lmgr/atomics.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/*
+ * We want the functions below to be inline; but if the compiler doesn't
+ * support that, fall back on providing them as regular functions.	See
+ * STATIC_IF_INLINE in c.h.
+ */
+#define ATOMICS_INCLUDE_DEFINITIONS
+
+#include "storage/atomics.h"
+#include "storage/spin.h"
+
+#ifdef PG_HAVE_MEMORY_BARRIER_EMULATION
+void
+pg_spinlock_barrier()
+{
+	S_LOCK(&dummy_spinlock);
+	S_UNLOCK(&dummy_spinlock);
+}
+#endif
+
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+#ifndef HAVE_SPINLOCKS
+	/*
+	 * NB: If we're using semaphore based TAS emulation, be careful to use a
+	 * separate set of semaphores. Otherwise we'd get in trouble if a atomic
+	 * var would be manipulated while spinlock is held.
+	 */
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+}
+
+bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return TAS((slock_t *) &ptr->sema);
+}
+
+void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	S_UNLOCK((slock_t *) &ptr->sema);
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+	/*
+	 * If we're using semaphore based atomic flags, be careful about nested
+	 * usage of atomics while a spinlock is held.
+	 */
+#ifndef HAVE_SPINLOCKS
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+	ptr->value = val_;
+}
+
+bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool ret;
+	/*
+	 * Do atomic op under a spinlock. It might look like we could just skip
+	 * the cmpxchg if the lock isn't available, but that'd just emulate a
+	 * 'weak' compare and swap. I.e. one that allows spurious failures. Since
+	 * several algorithms rely on a strong variant and that is efficiently
+	 * implementable on most major architectures let's emulate it here as
+	 * well.
+	 */
+	SpinLockAcquire((slock_t *) &ptr->sema);
+
+	/* perform compare/exchange logic*/
+	ret = ptr->value == *expected;
+	*expected = ptr->value;
+	if (ret)
+		ptr->value = newval;
+
+	/* and release lock */
+	SpinLockRelease((slock_t *) &ptr->sema);
+
+	return ret;
+}
+
+uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 oldval;
+	SpinLockAcquire((slock_t *) &ptr->sema);
+	oldval = ptr->value;
+	ptr->value += add_;
+	SpinLockRelease((slock_t *) &ptr->sema);
+	return oldval;
+}
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 9b71744..4c0c67b 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -65,7 +65,7 @@ SpinlockSemas(void)
 int
 SpinlockSemas(void)
 {
-	return NUM_SPINLOCK_SEMAPHORES;
+	return NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES;
 }
 
 /*
@@ -76,7 +76,7 @@ SpinlockSemaInit(PGSemaphore spinsemas)
 {
 	int			i;
 
-	for (i = 0; i < NUM_SPINLOCK_SEMAPHORES; ++i)
+	for (i = 0; i < SpinlockSemas(); ++i)
 		PGSemaphoreCreate(&spinsemas[i]);
 	SpinlockSemaArray = spinsemas;
 }
@@ -86,32 +86,32 @@ SpinlockSemaInit(PGSemaphore spinsemas)
  */
 
 void
-s_init_lock_sema(volatile slock_t *lock)
+s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
-	static int	counter = 0;
+   static int  counter = 0;
 
-	*lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
+   *lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
 }
 
 void
 s_unlock_sema(volatile slock_t *lock)
 {
-	PGSemaphoreUnlock(&SpinlockSemaArray[*lock]);
+   PGSemaphoreUnlock(&SpinlockSemaArray[*lock]);
 }
 
 bool
 s_lock_free_sema(volatile slock_t *lock)
 {
-	/* We don't currently use S_LOCK_FREE anyway */
-	elog(ERROR, "spin.c does not support S_LOCK_FREE()");
-	return false;
+   /* We don't currently use S_LOCK_FREE anyway */
+   elog(ERROR, "spin.c does not support S_LOCK_FREE()");
+   return false;
 }
 
 int
 tas_sema(volatile slock_t *lock)
 {
-	/* Note that TAS macros return 0 if *success* */
-	return !PGSemaphoreTryLock(&SpinlockSemaArray[*lock]);
+   /* Note that TAS macros return 0 if *success* */
+   return !PGSemaphoreTryLock(&SpinlockSemaArray[*lock]);
 }
 
 #endif   /* !HAVE_SPINLOCKS */
diff --git a/src/include/c.h b/src/include/c.h
index a48a57a..cbc1b2e 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -859,6 +859,13 @@ typedef NameData *Name;
 #define STATIC_IF_INLINE
 #endif   /* PG_USE_INLINE */
 
+#ifdef PG_USE_INLINE
+#define STATIC_IF_INLINE_DECLARE static inline
+#else
+#define STATIC_IF_INLINE_DECLARE extern
+#endif   /* PG_USE_INLINE */
+
+
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
  * ----------------------------------------------------------------
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 4fb7288..7e0099b 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -87,6 +87,9 @@
 /* Define to 1 if you have the `append_history' function. */
 #undef HAVE_APPEND_HISTORY
 
+/* Define to 1 if you have atomics. */
+#undef HAVE_ATOMICS
+
 /* Define to 1 if you have the `cbrt' function. */
 #undef HAVE_CBRT
 
@@ -173,8 +176,21 @@
 /* Define to 1 if your compiler understands __FUNCTION__. */
 #undef HAVE_FUNCNAME__FUNCTION
 
+/* Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int). */
+#undef HAVE_GCC__ATOMIC_INT32_CAS
+
+/* Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).
+   */
+#undef HAVE_GCC__ATOMIC_INT64_CAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int *, int, int). */
+#undef HAVE_GCC__SYNC_INT32_CAS
+
 /* Define to 1 if you have __sync_lock_test_and_set(int *) and friends. */
-#undef HAVE_GCC_INT_ATOMICS
+#undef HAVE_GCC__SYNC_INT32_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64). */
+#undef HAVE_GCC__SYNC_INT64_CAS
 
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 30a9a6a..fbdab7b 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -337,6 +337,9 @@
 /* Define to 1 if you have spinlocks. */
 #define HAVE_SPINLOCKS 1
 
+/* Define to 1 if you have atomics. */
+#define HAVE_ATOMICS 1
+
 /* Define to 1 if you have the `srandom' function. */
 /* #undef HAVE_SRANDOM */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 374345c..239cf7b 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -65,6 +65,20 @@
 #define NUM_SPINLOCK_SEMAPHORES		1024
 
 /*
+ * When we have neither spinlocks nor atomic operations support we're
+ * implementing atomic operations ontop of spinlock ontop of semaphores. To be
+ * safe against atomic operations while holding a spinlock separate semaphores
+ * have to be used.
+ */
+#define NUM_ATOMICS_SEMAPHORES		64
+
+/*
+ * Force usage of atomics.h based spinlocks even if s_lock.h has its own
+ * support for the target platform.
+ */
+#undef USE_ATOMICS_BASED_SPINLOCKS
+
+/*
  * Define this if you want to allow the lo_import and lo_export SQL
  * functions to be executed by ordinary users.  By default these
  * functions are only available to the Postgres superuser.  CAUTION:
@@ -307,3 +321,4 @@
  * set by ALTER SYSTEM command.
  */
 #define PG_AUTOCONF_FILENAME		"postgresql.auto.conf"
+
diff --git a/src/include/storage/atomics-arch-alpha.h b/src/include/storage/atomics-arch-alpha.h
new file mode 100644
index 0000000..06b7125
--- /dev/null
+++ b/src/include/storage/atomics-arch-alpha.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to Alpha
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-alpha.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-arch-amd64.h b/src/include/storage/atomics-arch-amd64.h
new file mode 100644
index 0000000..c849eca
--- /dev/null
+++ b/src/include/storage/atomics-arch-amd64.h
@@ -0,0 +1,184 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-amd64.h
+ *	  Atomic operations considerations specific to intel/amd x86-64
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-amd64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * x86_64 has similar ordering characteristics to i386.
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier_impl()
+#define pg_write_barrier_impl()		pg_compiler_barrier_impl()
+
+#if defined(__GNUC__)
+
+#if defined(HAVE_ATOMICS)
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	volatile char value;
+} pg_atomic_flag;
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* defined(HAVE_ATOMICS) */
+
+#endif /* defined(__GNUC__) */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if defined(HAVE_ATOMICS)
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	register char _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(ptr->value)
+:
+:		"memory");
+	return _res == 0;
+}
+
+#endif /* HAVE_ATOMICS */
+
+#if defined(__GNUC__) && !defined(PG_HAS_SPIN_DELAY)
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * Adding a PAUSE in the spin delay loop is demonstrably a no-op on
+	 * Opteron, but it may be of some use on EM64T, so we keep it.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif
+
+#if defined(WIN32_ONLY_COMPILER) && !defined(PG_HAS_SPIN_DELAY)
+#define PG_HAS_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#endif
+
+/*
+ * Use inline assembly on x86 gcc. It's nice to support older gcc's and the
+ * compare/exchange implementation here is actually more efficient than the
+ * __sync variant.
+ */
+#if defined(HAVE_ATOMICS) && defined(__GNUC__)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgl	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddl	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return (int) res;
+}
+
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgq	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddq	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return (int) res;
+}
+
+#endif /* defined(HAVE_ATOMICS) && defined(__GNUC__) */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-arch-arm.h b/src/include/storage/atomics-arch-arm.h
new file mode 100644
index 0000000..a00f8f5
--- /dev/null
+++ b/src/include/storage/atomics-arch-arm.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-arm.h
+ *	  Atomic operations considerations specific to ARM
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-arm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+
+/*
+ * FIXME: Disable 32bit atomics for ARMs before v6 or error out
+ */
+
+/*
+ * 64 bit atomics on arm are implemented using kernel fallbacks and might be
+ * slow, so disable entirely for now.
+ */
+#define PG_DISABLE_64_BIT_ATOMICS
diff --git a/src/include/storage/atomics-arch-hppa.h b/src/include/storage/atomics-arch-hppa.h
new file mode 100644
index 0000000..8913801
--- /dev/null
+++ b/src/include/storage/atomics-arch-hppa.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-alpha.h
+ *	  Atomic operations considerations specific to HPPA
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-hppa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* HPPA doesn't do either read or write reordering */
+#define pg_memory_barrier_impl()		pg_compiler_barrier_impl()
diff --git a/src/include/storage/atomics-arch-i386.h b/src/include/storage/atomics-arch-i386.h
new file mode 100644
index 0000000..094764f
--- /dev/null
+++ b/src/include/storage/atomics-arch-i386.h
@@ -0,0 +1,162 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-i386.h
+ *	  Atomic operations considerations specific to intel x86
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-i386.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * i386 does not allow loads to be reordered with other loads, or stores to be
+ * reordered with other stores, but a load can be performed before a subsequent
+ * store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ */
+
+#if defined(__INTEL_COMPILER)
+#define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__)
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier_impl()
+#define pg_write_barrier_impl()		pg_compiler_barrier_impl()
+
+#if defined(__GNUC__)
+
+#if defined(HAVE_ATOMICS)
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	volatile char value;
+} pg_atomic_flag;
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* defined(HAVE_ATOMICS) */
+
+#endif /* defined(__GNUC__) */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if defined(HAVE_ATOMICS)
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	register char _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(ptr->value)
+:
+:		"memory");
+	return _res == 0;
+}
+
+#endif /* HAVE_ATOMICS */
+
+#if defined(__GNUC__) && !defined(PG_HAS_SPIN_DELAY)
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	/*
+	 * This sequence is equivalent to the PAUSE instruction ("rep" is
+	 * ignored by old IA32 processors if the following instruction is
+	 * not a string operation); the IA-32 Architecture Software
+	 * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
+	 * PAUSE in the inner loop of a spin lock is necessary for good
+	 * performance:
+	 *
+	 *     The PAUSE instruction improves the performance of IA-32
+	 *     processors supporting Hyper-Threading Technology when
+	 *     executing spin-wait loops and other routines where one
+	 *     thread is accessing a shared lock or semaphore in a tight
+	 *     polling loop. When executing a spin-wait loop, the
+	 *     processor can suffer a severe performance penalty when
+	 *     exiting the loop because it detects a possible memory order
+	 *     violation and flushes the core processor's pipeline. The
+	 *     PAUSE instruction provides a hint to the processor that the
+	 *     code sequence is a spin-wait loop. The processor uses this
+	 *     hint to avoid the memory order violation and prevent the
+	 *     pipeline flush. In addition, the PAUSE instruction
+	 *     de-pipelines the spin-wait loop to prevent it from
+	 *     consuming execution resources excessively.
+	 */
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#endif /* defined(__GNUC__) && !defined(PG_HAS_SPIN_DELAY) */
+
+#if defined(WIN32_ONLY_COMPILER) && !defined(PG_HAS_SPIN_DELAY)
+#define PG_HAS_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	/* See comment for gcc code. Same code, MASM syntax */
+	__asm rep nop;
+}
+#endif
+
+/*
+ * Use inline assembly on x86 gcc. It's nice to support older gcc's and the
+ * compare/exchange implementation here is actually more efficient than the
+ * __sync variant.
+ */
+#if defined(HAVE_ATOMICS) && defined(__GNUC__)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgl	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddl	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return (int) res;
+}
+
+#endif /* defined(HAVE_ATOMICS) && defined(__GNUC__) */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-arch-ia64.h b/src/include/storage/atomics-arch-ia64.h
new file mode 100644
index 0000000..2863431
--- /dev/null
+++ b/src/include/storage/atomics-arch-ia64.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ia64.h
+ *	  Atomic operations considerations specific to intel itanium
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ia64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		__mf()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		__asm__ __volatile__ ("mf" : : : "memory")
+#elif defined(__hpux)
+#	define pg_compiler_barrier_impl()	_Asm_sched_fence()
+#	define pg_memory_barrier_impl()		_Asm_mf()
+#endif
diff --git a/src/include/storage/atomics-arch-ppc.h b/src/include/storage/atomics-arch-ppc.h
new file mode 100644
index 0000000..f785c99
--- /dev/null
+++ b/src/include/storage/atomics-arch-ppc.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-arch-ppc.h
+ *	  Atomic operations considerations specific to PowerPC
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/storage/atomics-arch-ppc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#endif
diff --git a/src/include/storage/atomics-fallback.h b/src/include/storage/atomics-fallback.h
new file mode 100644
index 0000000..cc14a95
--- /dev/null
+++ b/src/include/storage/atomics-fallback.h
@@ -0,0 +1,129 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-fallback.h
+ *    Fallback for platforms without spinlock and/or atomics support. Slow.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomics-fallback.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+#ifndef pg_memory_barrier_impl
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#define PG_HAVE_MEMORY_BARRIER_EMULATION
+
+extern void pg_spinlock_barrier();
+#define pg_memory_barrier_impl pg_spinlock_barrier()
+#endif
+
+/*
+ * If we have atomics implementation for this platform fall back to providing
+ * the atomics API using a spinlock to protect the internal state. Possibly
+ * the spinlock implementation uses semaphores internally...
+ *
+ * We have to be a bit careful here, as it's not guaranteed that atomic
+ * variables are mapped to the same address in every process (e.g. dynamic
+ * shared memory segments). We can't just hash the address and use that to map
+ * to a spinlock. Instead assign a spinlock on initialization of the atomic
+ * variable.
+ */
+#ifndef PG_HAVE_ATOMIC_FLAG_SUPPORT
+
+#define PG_HAVE_ATOMIC_FLAG_SIMULATION
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+
+typedef struct pg_atomic_flag
+{
+	/*
+	 * To avoid circular includes we can't use s_lock as a type here. Instead
+	 * just reserve enough space for all spinlock types. Some platforms would
+	 * be content with just one byte instead of 4, but that's not too much
+	 * waste.
+	 */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+} pg_atomic_flag;
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SUPPORT */
+
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+
+#define PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	/* Check pg_atomic_flag's definition above for an explanation */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* PG_HAVE_ATOMIC_U32_SUPPORT */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+extern void pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+extern bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+extern void pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * Can't do this in the semaphore based implementation - always return
+	 * true. Do this in the header so compilers can optimize the test away.
+	 */
+	return true;
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAS_ATOMIC_INIT_U32
+extern void pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_);
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+extern bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+												uint32 *expected, uint32 newval);
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+extern uint32 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_);
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
+
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-acc.h b/src/include/storage/atomics-generic-acc.h
new file mode 100644
index 0000000..2dd8011
--- /dev/null
+++ b/src/include/storage/atomics-generic-acc.h
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-acc.h
+ *	  Atomic operations support when using HPs acc on HPUX
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * inline assembly for Itanium-based HP-UX:
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
+ * * Implementing Spinlocks on the Intel ® Itanium ® Architecture and PA-RISC
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
+ *
+ * Itanium only supports a small set of numbers (6, -8, -4, -1, 1, 4, 8, 16)
+ * for atomic add/sub, so we just implement everything but compare_exchange via
+ * the compare_exchange fallbacks in atomics-generic.h.
+ *
+ * src/include/storage/atomics-generic-acc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <machine/sys/inline.h>
+
+/* IA64 always has 32/64 bit atomics */
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define MINOR_FENCE (_Asm_fence) (_UP_CALL_FENCE | _UP_SYS_FENCE | \
+								 _DOWN_CALL_FENCE | _DOWN_SYS_FENCE )
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	/*
+	 * We want a barrier, not just release/acquire semantics.
+	 */
+	_Asm_mf();
+	/*
+	 * Notes:
+	 * DOWN_MEM_FENCE | _UP_MEM_FENCE prevents reordering by the compiler
+	 */
+	current =  _Asm_cmpxchg(_SZ_W, /* word */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	_Asm_mf();
+	current =  _Asm_cmpxchg(_SZ_D, /* doubleword */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#undef MINOR_FENCE
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-gcc.h b/src/include/storage/atomics-generic-gcc.h
new file mode 100644
index 0000000..f491234
--- /dev/null
+++ b/src/include/storage/atomics-generic-gcc.h
@@ -0,0 +1,218 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-gcc.h
+ *	  Atomic operations, implemented using gcc (or compatible) intrinsics.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Legacy __sync Built-in Functions for Atomic Memory Access
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fsync-Builtins.html
+ * * Built-in functions for memory model aware atomic operations
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
+ *
+ * src/include/storage/atomics-generic-gcc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+#define pg_compiler_barrier_impl()	__asm__ __volatile__("" ::: "memory")
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#if !defined(pg_memory_barrier_impl)
+#	if defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
+#	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
+#		define pg_memory_barrier_impl()		__sync_synchronize()
+#	endif
+#endif /* !defined(pg_memory_barrier_impl) */
+
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
+#endif
+
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
+#endif
+
+#ifdef HAVE_ATOMICS
+
+/* generic gcc based atomic flag implementation */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) \
+	&& (defined(HAVE_GCC__SYNC_INT32_TAS) || defined(HAVE_GCC__SYNC_CHAR_TAS))
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+#ifdef HAVE_GCC__SYNC_CHAR_TAS
+	volatile char value;
+#else
+	/* use an int here since that works on more platforms */
+	volatile int value;
+#endif
+} pg_atomic_flag;
+
+#endif /* !ATOMIC_FLAG_SUPPORT && SYNC_INT32_TAS */
+
+/* generic gcc based atomic uint32 implementation */
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT) \
+	&& (defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS))
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS) */
+
+/* generic gcc based atomic uint64 implementation */
+#if !defined(PG_HAVE_ATOMIC_U64_SUPPORT) \
+	&& (defined(HAVE_GCC__ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS))
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* defined(HAVE_GCC__PG_ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS) */
+
+/*
+ * Implementation follows. Inlined or directly included from atomics.c
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && \
+	(defined(HAVE_GCC__SYNC_CHAR_TAS) || defined(HAVE_GCC__SYNC_INT32_TAS))
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* NB: only a acquire barrier, not a full one */
+	/* some platform only support a 1 here */
+	return __sync_lock_test_and_set(&ptr->value, 1) == 0;
+}
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_*_TAS) */
+
+#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value == 0;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_CLEAR_FLAG
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * XXX: It would be nicer to use __sync_lock_release here, but gcc insists
+	 * on making that an atomic op which is far to expensive and a stronger
+	 * guarantee than what we actually need.
+	 */
+	pg_write_barrier_impl();
+	ptr->value = 0;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_INIT_FLAG
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_clear_flag_impl(ptr);
+}
+#endif
+
+/* prefer __atomic, it has a better API */
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	/* FIXME: we can probably use a lower consistency model */
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#endif /* defined(HAVE_ATOMICS) */
+
diff --git a/src/include/storage/atomics-generic-msvc.h b/src/include/storage/atomics-generic-msvc.h
new file mode 100644
index 0000000..8fefc10
--- /dev/null
+++ b/src/include/storage/atomics-generic-msvc.h
@@ -0,0 +1,90 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-msvc.h
+ *	  Atomic operations support when using MSVC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Interlocked Variable Access
+ *   http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx
+ *
+ * src/include/storage/atomics-generic-msvc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <intrin.h>
+#include <windows.h>
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/* Should work on both MSVC and Borland. */
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier_impl()	_ReadWriteBarrier()
+
+#ifndef pg_memory_barrier_impl
+#define pg_memory_barrier_impl()	MemoryBarrier()
+#endif
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return InterlockedExchangeAdd(&ptr->value, add_);
+}
+
+/*
+ * The non-intrinsics version is only available in vista upwards, so use the
+ * intrisc version. Only supported on >486, but we require XP as a minimum
+ * baseline, which doesn't support the 486, so we don't need to add checks for
+ * that case.
+ */
+#pragma intrinsic(_InterlockedCompareExchange64)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = _InterlockedCompareExchange64(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic-sunpro.h b/src/include/storage/atomics-generic-sunpro.h
new file mode 100644
index 0000000..1f30be3
--- /dev/null
+++ b/src/include/storage/atomics-generic-sunpro.h
@@ -0,0 +1,87 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-sunpro.h
+ *	  Atomic operations for solaris' CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * manpage for atomic_cas(3C)
+ *   http://www.unix.com/man-page/opensolaris/3c/atomic_cas/
+ *   http://docs.oracle.com/cd/E23824_01/html/821-1465/atomic-cas-3c.html
+ *
+ * src/include/storage/atomics-generic-sunpro.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	current = atomic_cas_32(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return atomic_add_32(&ptr->value, add_);
+}
+#endif
+
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_64(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return atomic_add_64(&ptr->value, add_);
+}
+#endif
+
+#endif
diff --git a/src/include/storage/atomics-generic-xlc.h b/src/include/storage/atomics-generic-xlc.h
new file mode 100644
index 0000000..4351d1d
--- /dev/null
+++ b/src/include/storage/atomics-generic-xlc.h
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic-xlc.h
+ *	  Atomic operations for IBM's CC
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Synchronization and atomic built-in functions
+ *   http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/bif_sync.htm
+ *
+ * src/include/storage/atomics-generic-xlc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+/* FIXME: check for 64bit mode, only supported there */
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	/*
+	 * xlc's documentation tells us:
+	 * "If __compare_and_swap is used as a locking primitive, insert a call to
+	 * the __isync built-in function at the start of any critical sections."
+	 */
+	__isync();
+
+	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
+										 (int)*expected, (int)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	__isync();
+
+	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
+										   (long)*expected, (long)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __fetch_and_addlp(&ptr->value, add_);
+}
+#endif
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics-generic.h b/src/include/storage/atomics-generic.h
new file mode 100644
index 0000000..265e49c
--- /dev/null
+++ b/src/include/storage/atomics-generic.h
@@ -0,0 +1,414 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics-generic.h
+ *	  Atomic operations.
+ *
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * src/include/storage/atomics-generic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+#ifndef pg_memory_barrier_impl
+#	error "postgres requires a memory barrier implementation"
+#endif
+
+#ifndef pg_compiler_barrier_impl
+#	error "postgres requires a compiler barrier implementation"
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ */
+#if !defined(pg_read_barrier_impl)
+#	define pg_read_barrier_impl pg_memory_barrier_impl
+#endif
+#if !defined(pg_write_barrier_impl)
+#	define pg_write_barrier_impl pg_memory_barrier_impl
+#endif
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+#define pg_spin_delay_impl()	((void)0)
+#endif
+
+
+/* provide fallback */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) && defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef pg_atomic_uint32 pg_atomic_flag;
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+/* Need a way to implement spinlocks */
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && \
+	!defined(PG_HAS_ATOMIC_EXCHANGE_U32) && \
+	!defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#	error "postgres requires atomic ops for locking to be provided"
+#endif
+
+#ifndef PG_HAS_ATOMIC_READ_U32
+#define PG_HAS_ATOMIC_READ_U32
+static inline uint32
+pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	return *(&ptr->value);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U32
+#define PG_HAS_ATOMIC_WRITE_U32
+static inline void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	ptr->value = val;
+}
+#endif
+
+/*
+ * provide fallback for test_and_set using atomic_exchange if available
+ */
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_exchange_u32_impl(ptr, &value, 1) == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_read_u32_impl(ptr) == 1;
+}
+
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+/*
+ * provide fallback for test_and_set using atomic_compare_exchange if
+ * available.
+ */
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 value = 0;
+	return pg_atomic_compare_exchange_u32_impl(ptr, &value, 1);
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr);
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG)
+#	error "No pg_atomic_test_and_set provided"
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) */
+
+
+#ifndef PG_HAS_ATOMIC_INIT_U32
+#define PG_HAS_ATOMIC_INIT_U32
+static inline void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	pg_atomic_write_u32_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_EXCHANGE_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_EXCHANGE_U32
+static inline uint32
+pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_SUB_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_SUB_U32
+static inline uint32
+pg_atomic_fetch_sub_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_AND_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_AND_U32
+static inline uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_OR_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_OR_U32
+static inline uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_ADD_U32)
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_SUB_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_SUB_U32)
+#define PG_HAS_ATOMIC_SUB_FETCH_U32
+static inline uint32
+pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+static inline uint32
+pg_atomic_fetch_add_until_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_, uint32 until)
+{
+	uint32 old;
+	while (true)
+	{
+		uint32 new_val;
+		old = pg_atomic_read_u32_impl(ptr);
+		if (old >= until)
+			break;
+
+		new_val = old + add_;
+		if (new_val > until)
+			new_val = until;
+
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, new_val))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+#if !defined(PG_HAS_ATOMIC_EXCHANGE_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_EXCHANGE_U64
+static inline uint64
+pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = ptr->value;
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+static inline void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	/*
+	 * 64 bit writes aren't safe on all platforms. In the generic
+	 * implementation implement them as an atomic exchange. That might store a
+	 * 0, but only if the prev. value also was a 0.
+	 */
+	pg_atomic_exchange_u64_impl(ptr, val);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_READ_U64
+#define PG_HAS_ATOMIC_READ_U64
+static inline uint64
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	uint64 old = 0;
+
+	/*
+	 * 64 bit reads aren't safe on all platforms. In the generic
+	 * implementation implement them as a compare/exchange with 0. That'll
+	 * fail or suceed, but always return the old value
+	 */
+	pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
+
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_INIT_U64
+#define PG_HAS_ATOMIC_INIT_U64
+static inline void
+pg_atomic_init_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val_)
+{
+	pg_atomic_write_u64_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_SUB_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_SUB_U64
+static inline uint64
+pg_atomic_fetch_sub_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_AND_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_AND_U64
+static inline uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_OR_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_OR_U64
+static inline uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U64) && defined(PG_HAS_ATOMIC_FETCH_ADD_U64)
+#define PG_HAS_ATOMIC_ADD_FETCH_U64
+static inline uint64
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_SUB_FETCH_U64) && defined(PG_HAS_ATOMIC_FETCH_SUB_U64)
+#define PG_HAS_ATOMIC_SUB_FETCH_U64
+static inline uint64
+pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#endif /* PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/atomics.h b/src/include/storage/atomics.h
new file mode 100644
index 0000000..8b58a72
--- /dev/null
+++ b/src/include/storage/atomics.h
@@ -0,0 +1,564 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.h
+ *	  Atomic operations.
+ *
+ * Hardware and compiler dependent functions for manipulating memory
+ * atomically and dealing with cache coherency. Used to implement
+ * locking facilities and replacements.
+ *
+ * To bring up postgres on a platform/compiler at the very least
+ * either one of
+ * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
+ * * pg_atomic_compare_exchange_u32()
+ * * pg_atomic_exchange_u32()
+ * and
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * should be implemented. There exist generic, hardware independent,
+ * implementations for several compilers which might be sufficient,
+ * although possibly not optimal, for a new platform. If no such generic
+ * implementation is available spinlocks (or even OS provided semaphores)
+ * will be used to implement the API. Won't be blazingly fast though.
+ *
+ * Use higher level functionality (lwlocks, spinlocks, heavyweight
+ * locks) whenever possible. Writing correct code using these
+ * facilities is hard.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomics.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMICS_H
+#define ATOMICS_H
+
+#define INSIDE_ATOMICS_H
+
+#include <limits.h>
+
+/*
+ * Architecture specific implementations/considerations.
+ */
+#if defined(__arm__) || defined(__arm) || \
+	defined(__aarch64__) || defined(__aarch64)
+#	include "storage/atomics-arch-arm.h"
+#elif defined(__i386__) || defined(__i386)
+#	include "storage/atomics-arch-i386.h"
+#elif defined(__x86_64__)
+#	include "storage/atomics-arch-amd64.h"
+#elif defined(__ia64__) || defined(__ia64)
+#	include "storage/atomics-arch-ia64.h"
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+#	include "storage/atomics-arch-ppc.h"
+#elif defined(__alpha) || defined(__alpha__)
+#	include "storage/atomics-arch-alpha.h"
+#elif defined(__hppa) || defined(__hppa__)
+#	include "storage/atomics-arch-hppa.h"
+#endif
+
+/*
+ * Compiler specific, but architecture independent implementations.
+ */
+/* gcc or compatible, including icc */
+#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
+#	include "storage/atomics-generic-gcc.h"
+#elif defined(WIN32_ONLY_COMPILER)
+#	include "storage/atomics-generic-msvc.h"
+#elif defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
+#	include "storage/atomics-generic-acc.h"
+#elif defined(__SUNPRO_C) && !defined(__GNUC__)
+#	include "storage/atomics-generic-sunpro.h"
+#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
+#	include "storage/atomics-generic-xlc.h"
+#else
+/* unsupported compiler, we'll likely use slow fallbacks... */
+#endif
+
+/*
+ * Provide fallbacks for platforms without sufficient spinlock and/or atomics
+ * support.
+ */
+#include "storage/atomics-fallback.h"
+
+/*
+ * Provide additional operations using supported infrastructure. These are
+ * expected to be efficient.
+ */
+#include "storage/atomics-generic.h"
+
+/*
+ * pg_compiler_barrier - prevent the compiler from moving code
+ *
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ */
+#define pg_compiler_barrier()	pg_compiler_barrier_impl()
+
+/*
+ * pg_memory_barrier - prevent the CPU from reordering memory access
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ */
+#define pg_memory_barrier()	pg_memory_barrier_impl()
+
+/*
+ * pg_(read|write)_barrier - prevent the CPU from reordering memory access
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.	Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+#define pg_read_barrier()	pg_read_barrier_impl()
+#define pg_write_barrier()	pg_write_barrier_impl()
+
+/*
+ * Spinloop delay -
+ */
+#define pg_spin_delay()	pg_spin_delay_impl()
+
+
+/*
+ * pg_atomic_init_flag - initialize lock
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_test_and_set_flag - TAS()
+ *
+ * Acquire/read barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_unlocked_test_flag - TAS()
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_clear_flag - release lock set by TAS()
+ *
+ * Release/write barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
+
+/*
+ * pg_atomic_init_u32 - initialize atomic variable
+ *
+ * Has to be done before any usage except when the atomic variable is declared
+ * statically in which case the variable is initialized to 0.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_read_u32 - read from atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr);
+
+/*
+ * pg_atomic_write_u32 - write to atomic variable
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+
+/*
+ * pg_atomic_exchange_u32 - exchange newval with current value
+ *
+ * Returns the old value of 'ptr' before the swap.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval);
+
+/*
+ * pg_atomic_compare_exchange_u32 - CAS operation
+ *
+ * Atomically compare the current value of ptr with *expected and store newval
+ * iff ptr and *expected have the same value. The current value of *ptr will
+ * always be stored in *expected.
+ *
+ * Return whether the values have been exchanged.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval);
+
+/*
+ * pg_atomic_fetch_add_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+
+/*
+ * pg_atomic_fetch_sub_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr before the arithmetic operation. Note that
+ * sub_ may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+
+/*
+ * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_);
+
+/*
+ * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_);
+
+/*
+ * pg_atomic_add_fetch_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+
+/*
+ * pg_atomic_sub_fetch_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr after the arithmetic operation. Note that sub_
+ * may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+
+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until);
+
+/* ----
+ * The 64 bit operations have the same semantics as their 32bit counterparts if
+ * they are available.
+ * ---
+ */
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr);
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+
+STATIC_IF_INLINE_DECLARE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+
+/*
+ * The following functions are wrapper functions around the platform specific
+ * implementation of the atomic operations performing common checks.
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))
+
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_init_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_test_set_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_unlocked_test_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_init_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_write_u32_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_read_u32_impl(ptr);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	return pg_atomic_exchange_u32_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	CHECK_POINTER_ALIGNMENT(expected, 4);
+
+	return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_and_u32_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_or_u32_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_add_fetch_u32_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_init_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u64_impl(ptr);
+}
+
+STATIC_IF_INLINE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	pg_atomic_write_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	return pg_atomic_exchange_u64_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	CHECK_POINTER_ALIGNMENT(expected, 8);
+	return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_add_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_and_u64_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_or_u64_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_add_fetch_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
+}
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#undef INSIDE_ATOMICS_H
+
+#endif /* ATOMICS_H */
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index bc61de0..5730a29 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -13,167 +13,11 @@
 #ifndef BARRIER_H
 #define BARRIER_H
 
-#include "storage/s_lock.h"
-
-extern slock_t dummy_spinlock;
-
-/*
- * A compiler barrier need not (and preferably should not) emit any actual
- * machine code, but must act as an optimization fence: the compiler must not
- * reorder loads or stores to main memory around the barrier.  However, the
- * CPU may still reorder loads or stores at runtime, if the architecture's
- * memory model permits this.
- *
- * A memory barrier must act as a compiler barrier, and in addition must
- * guarantee that all loads and stores issued prior to the barrier are
- * completed before any loads or stores issued after the barrier.  Unless
- * loads and stores are totally ordered (which is not the case on most
- * architectures) this requires issuing some sort of memory fencing
- * instruction.
- *
- * A read barrier must act as a compiler barrier, and in addition must
- * guarantee that any loads issued prior to the barrier are completed before
- * any loads issued after the barrier.  Similarly, a write barrier acts
- * as a compiler barrier, and also orders stores.  Read and write barriers
- * are thus weaker than a full memory barrier, but stronger than a compiler
- * barrier.  In practice, on machines with strong memory ordering, read and
- * write barriers may require nothing more than a compiler barrier.
- *
- * For an introduction to using memory barriers within the PostgreSQL backend,
- * see src/backend/storage/lmgr/README.barrier
- */
-
-#if defined(DISABLE_BARRIERS)
-
-/*
- * Fall through to the spinlock-based implementation.
- */
-#elif defined(__INTEL_COMPILER)
-
-/*
- * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
- */
-#if defined(__ia64__) || defined(__ia64)
-#define pg_memory_barrier()		__mf()
-#elif defined(__i386__) || defined(__x86_64__)
-#define pg_memory_barrier()		_mm_mfence()
-#endif
-
-#define pg_compiler_barrier()	__memory_barrier()
-#elif defined(__GNUC__)
-
-/* This works on any architecture, since it's only talking to GCC itself. */
-#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
-
-#if defined(__i386__)
-
-/*
- * i386 does not allow loads to be reordered with other loads, or stores to be
- * reordered with other stores, but a load can be performed before a subsequent
- * store.
- *
- * "lock; addl" has worked for longer than "mfence".
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__x86_64__)		/* 64 bit x86 */
-
-/*
- * x86_64 has similar ordering characteristics to i386.
- *
- * Technically, some x86-ish chips support uncached memory access and/or
- * special instructions that are weakly ordered.  In those cases we'd need
- * the read and write barriers to be lfence and sfence.  But since we don't
- * do those things, a compiler barrier should be enough.
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__ia64__) || defined(__ia64)
-
-/*
- * Itanium is weakly ordered, so read and write barriers require a full
- * fence.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-
 /*
- * lwsync orders loads with respect to each other, and similarly with stores.
- * But a load can be performed before a subsequent store, so sync must be used
- * for a full memory barrier.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#elif defined(__alpha) || defined(__alpha__)	/* Alpha */
-
-/*
- * Unlike all other known architectures, Alpha allows dependent reads to be
- * reordered, but we don't currently find it necessary to provide a conditional
- * read barrier to cover that case.  We might need to add that later.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
-#elif defined(__hppa) || defined(__hppa__)		/* HP PA-RISC */
-
-/* HPPA doesn't do either read or write reordering */
-#define pg_memory_barrier()		pg_compiler_barrier()
-#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
-
-/*
- * If we're on GCC 4.1.0 or higher, we should be able to get a memory
- * barrier out of this compiler built-in.  But we prefer to rely on our
- * own definitions where possible, and use this only as a fallback.
- */
-#define pg_memory_barrier()		__sync_synchronize()
-#endif
-#elif defined(__ia64__) || defined(__ia64)
-
-#define pg_compiler_barrier()	_Asm_sched_fence()
-#define pg_memory_barrier()		_Asm_mf()
-#elif defined(WIN32_ONLY_COMPILER)
-
-/* Should work on both MSVC and Borland. */
-#include <intrin.h>
-#pragma intrinsic(_ReadWriteBarrier)
-#define pg_compiler_barrier()	_ReadWriteBarrier()
-#define pg_memory_barrier()		MemoryBarrier()
-#endif
-
-/*
- * If we have no memory barrier implementation for this architecture, we
- * fall back to acquiring and releasing a spinlock.  This might, in turn,
- * fall back to the semaphore-based spinlock implementation, which will be
- * amazingly slow.
- *
- * It's not self-evident that every possible legal implementation of a
- * spinlock acquire-and-release would be equivalent to a full memory barrier.
- * For example, I'm not sure that Itanium's acq and rel add up to a full
- * fence.  But all of our actual implementations seem OK in this regard.
- */
-#if !defined(pg_memory_barrier)
-#define pg_memory_barrier() \
-	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
-#endif
-
-/*
- * If read or write barriers are undefined, we upgrade them to full memory
- * barriers.
- *
- * If a compiler barrier is unavailable, you probably don't want a full
- * memory barrier instead, so if you have a use case for a compiler barrier,
- * you'd better use #ifdef.
+ * This used to be a separate file, full of compiler/architecture
+ * dependent defines, but it's not included in the atomics.h
+ * infrastructure and just kept for backward compatibility.
  */
-#if !defined(pg_read_barrier)
-#define pg_read_barrier()			pg_memory_barrier()
-#endif
-#if !defined(pg_write_barrier)
-#define pg_write_barrier()			pg_memory_barrier()
-#endif
+#include "storage/atomics.h"
 
 #endif   /* BARRIER_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3b94340..c14974d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -16,6 +16,7 @@
 
 #include "lib/ilist.h"
 #include "storage/s_lock.h"
+#include "storage/atomics.h"
 
 struct PGPROC;
 
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index ba4dfe1..5dffac7 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -96,6 +96,8 @@
 
 #ifdef HAVE_SPINLOCKS	/* skip spinlocks if requested */
 
+#if !defined(USE_ATOMICS_BASED_SPINLOCKS)
+
 #if defined(__GNUC__) || defined(__INTEL_COMPILER)
 /*************************************************************************
  * All the gcc inlines
@@ -890,13 +892,26 @@ spin_delay(void)
 	/* See comment for gcc code. Same code, MASM syntax */
 	__asm rep nop;
 }
-#endif
-
-#endif
+#endif /* defined(_WIN64) */
 
+#endif /* WIN32_ONLY_COMPILER */
 
 #endif	/* !defined(HAS_TEST_AND_SET) */
 
+#endif /* !defined(USE_ATOMICS_BASED_SPINLOCKS) */
+
+
+#ifndef HAS_TEST_AND_SET
+#include <storage/atomics.h>
+
+#define HAS_TEST_AND_SET
+typedef pg_atomic_flag slock_t;
+#define TAS(lock)			(pg_atomic_test_set_flag(lock) ? 0 : 1)
+#define TAS_SPIN(lock)		(pg_atomic_unlocked_test_flag(lock) ? TAS(lock) : 1)
+#define S_UNLOCK(lock)		pg_atomic_clear_flag(lock)
+#define S_INIT_LOCK(lock)	pg_atomic_init_flag(lock)
+#define SPIN_DELAY()		pg_spin_delay()
+#endif
 
 /* Blow up if we didn't have any way to do spinlocks */
 #ifndef HAS_TEST_AND_SET
@@ -916,12 +931,12 @@ typedef int slock_t;
 
 extern bool s_lock_free_sema(volatile slock_t *lock);
 extern void s_unlock_sema(volatile slock_t *lock);
-extern void s_init_lock_sema(volatile slock_t *lock);
+extern void s_init_lock_sema(volatile slock_t *lock, bool nested);
 extern int	tas_sema(volatile slock_t *lock);
 
 #define S_LOCK_FREE(lock)	s_lock_free_sema(lock)
 #define S_UNLOCK(lock)	 s_unlock_sema(lock)
-#define S_INIT_LOCK(lock)	s_init_lock_sema(lock)
+#define S_INIT_LOCK(lock)	s_init_lock_sema(lock, false)
 #define TAS(lock)	tas_sema(lock)
 
 
diff --git a/src/test/regress/expected/lock.out b/src/test/regress/expected/lock.out
index 0d7c3ba..fd27344 100644
--- a/src/test/regress/expected/lock.out
+++ b/src/test/regress/expected/lock.out
@@ -60,3 +60,11 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
+ test_atomic_ops 
+-----------------
+ t
+(1 row)
+
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index aef1518..1fded84 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -57,6 +57,11 @@ CREATE FUNCTION make_tuple_indirect (record)
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
 
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
+
 -- Things that shouldn't work:
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index 9761d12..9926c90 100644
--- a/src/test/regress/output/create_function_1.source
+++ b/src/test/regress/output/create_function_1.source
@@ -51,6 +51,10 @@ CREATE FUNCTION make_tuple_indirect (record)
         RETURNS record
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index c146109..4b889d2 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -16,6 +16,7 @@
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "executor/spi.h"
+#include "storage/atomics.h"
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/rel.h"
@@ -822,3 +823,227 @@ make_tuple_indirect(PG_FUNCTION_ARGS)
 	 */
 	PG_RETURN_POINTER(newtup->t_data);
 }
+
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+static void
+test_atomic_flag(void)
+{
+	pg_atomic_flag flag;
+
+	pg_atomic_init_flag(&flag);
+
+	if (!pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unduly set");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	if (pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: set spuriously #2");
+
+	pg_atomic_clear_flag(&flag);
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+}
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+static void
+test_atomic_uint32(void)
+{
+	pg_atomic_uint32 var;
+	uint32 expected;
+	int i;
+
+	pg_atomic_init_u32(&var, 0);
+
+	if (pg_atomic_read_u32(&var) != 0)
+		elog(ERROR, "atomic_read_u32() #1 wrong");
+
+	pg_atomic_write_u32(&var, 3);
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u32() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u32() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u32(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u32() #1 wrong");
+
+	if (pg_atomic_add_fetch_u32(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u32() #0 wrong");
+
+	/* test around numerical limits */
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != 0)
+		elog(ERROR, "pg_atomic_fetch_add_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != INT_MAX)
+		elog(ERROR, "pg_atomic_add_fetch_u32() #3 wrong");
+
+	pg_atomic_fetch_add_u32(&var, 1); /* top up to UINT_MAX */
+
+	if (pg_atomic_read_u32(&var) != UINT_MAX)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, INT_MAX) != UINT_MAX)
+		elog(ERROR, "pg_atomic_fetch_sub_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != (uint32)INT_MAX + 1)
+		elog(ERROR, "atomic_read_u32() #3 wrong: %u", pg_atomic_read_u32(&var));
+
+	expected = pg_atomic_sub_fetch_u32(&var, INT_MAX);
+	if (expected != 1)
+		elog(ERROR, "pg_atomic_sub_fetch_u32() #3 wrong: %u", expected);
+
+	pg_atomic_sub_fetch_u32(&var, 1);
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u32(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u32() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 1000; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u32(&var, &expected, 1))
+			break;
+	}
+	if (i == 1000)
+		elog(ERROR, "atomic_compare_exchange_u32() never succeeded");
+	if (pg_atomic_read_u32(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u32() didn't set value properly");
+
+	pg_atomic_write_u32(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u32(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u32() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u32(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u32()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u32(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #1 wrong");
+
+	if (pg_atomic_fetch_and_u32(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #2 wrong: is %u",
+			 pg_atomic_read_u32(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u32(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #3 wrong");
+}
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+static void
+test_atomic_uint64(void)
+{
+	pg_atomic_uint64 var;
+	uint64 expected;
+	int i;
+
+	pg_atomic_init_u64(&var, 0);
+
+	if (pg_atomic_read_u64(&var) != 0)
+		elog(ERROR, "atomic_read_u64() #1 wrong");
+
+	pg_atomic_write_u64(&var, 3);
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "atomic_read_u64() #2 wrong");
+
+	if (pg_atomic_fetch_add_u64(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u64() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u64(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u64() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u64(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u64() #1 wrong");
+
+	if (pg_atomic_add_fetch_u64(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u64() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u64(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u64() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 100; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u64(&var, &expected, 1))
+			break;
+	}
+	if (i == 100)
+		elog(ERROR, "atomic_compare_exchange_u64() never succeeded");
+	if (pg_atomic_read_u64(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u64() didn't set value properly");
+
+	pg_atomic_write_u64(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u64(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u64() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u64(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u64() #2 wrong");
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u64()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u64(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #1 wrong");
+
+	if (pg_atomic_fetch_and_u64(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #2 wrong: is "UINT64_FORMAT,
+			 pg_atomic_read_u64(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u64(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
+
+
+PG_FUNCTION_INFO_V1(test_atomic_ops);
+Datum
+test_atomic_ops(PG_FUNCTION_ARGS)
+{
+	/*
+	 * Can't run the test under the semaphore emulation, it doesn't handle
+	 * checking edge cases well.
+	 */
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+	test_atomic_flag();
+#endif
+
+	test_atomic_uint32();
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+	test_atomic_uint64();
+#endif
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/test/regress/sql/lock.sql b/src/test/regress/sql/lock.sql
index dda212f..567e8bc 100644
--- a/src/test/regress/sql/lock.sql
+++ b/src/test/regress/sql/lock.sql
@@ -64,3 +64,8 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+
+
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
-- 
2.0.0.rc2.4.g1dc51c6.dirty

#37Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#1)
Re: better atomics - v0.5

On 06/25/2014 10:14 AM, Andres Freund wrote:

Hi,

[sorry for the second copy Robert]

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

Is this at a state where we can performance-test it yet?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Noname
andres@anarazel.de
In reply to: Josh Berkus (#37)
Re: better atomics - v0.5

On 2014-06-25 10:39:53 -0700, Josh Berkus wrote:

On 06/25/2014 10:14 AM, Andres Freund wrote:

Hi,

[sorry for the second copy Robert]

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

Is this at a state where we can performance-test it yet?

Well, this patch itself won't have a performance impact at all. It's
about proving the infrastructure to use atomic operations in further
patches in a compiler and platform independent way.
I'll repost a new version of the lwlocks patch (still fighting an issue
I introduced while rebasing ntop of Heikki's xlog scalability/lwlock
merger) soon. Addressing the review comments Amit has made since.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#36)
Re: better atomics - v0.5

On Wed, Jun 25, 2014 at 1:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:

[sorry for the second copy Robert]

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up. All try to implement TAS, 32 bit
atomic add, 32 bit compare exchange; some do it for 64bit as well.

* x86 gcc inline assembly means that we support every gcc version on x86

= i486; even when the __sync APIs aren't available yet.

* 'inline' support isn't required anymore. We fall back to
ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery.

* When the current platform doesn't have support for atomic operations a
spinlock backed implementation is used. This can be forced using
--disable-atomics.

That even works when semaphores are used to implement spinlocks (a
separate array is used there to avoid nesting problems). It contrast
to an earlier implementation this even works when atomics are mapped
to different addresses in individual processes (think dsm).

* s_lock.h falls back to the atomics.h provided APIs iff it doesn't have
native support for the current platform. This can be forced by
defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler
intrinsics based implementations that should make it easier to bring
up postgres on different platfomrs.

* I tried to improve the documentation of the facilities in
src/include/storage/atomics.h. Including documentation of the various
barrier semantics.

* There's tests in regress.c that get call via a SQL function from the
regression tests.

* Lots of other details changed, but that's the major pieces.

Review:

- The changes to spin.c include unnecessary whitespace adjustments.
- The new argument to s_init_lock_sema() isn't used.
- "on top" is two words, not one ("ontop").
- The changes to pg_config_manual.h add a superfluous blank line.
- Are the regression tests in regress.c likely to catch anything? Really?
- The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
seems to be testing for __builtin_constant_p. I don't see that being
used in the patch anywhere, though; and also, why doesn't the naming
match?
- s_lock.h adds some fallback code to use atomics to define TAS on
platforms where GCC supports atomics but where we do not have a
working TAS implementation. However, this supposed fallback code
defines HAS_TEST_AND_SET unconditionally, so I think it will get used
even if we don't have (or have disabled via configure) atomics.
- I completely loathe the file layout you've proposed in
src/include/storage. If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.
- What's the point of defining pg_read_barrier() to be
pg_read_barrier_impl() and pg_write_barrier() to be
pg_write_barrier_impl() and so forth? That seems like unnecessary
notation.
- Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
variable instead of re-executing it every time? Granted, the compiler
might inline this somehow, but...

More generally, I'm really wondering if you're making the
compare-and-swap implementation a lot more complicated than it needs
to be. How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches? I suspect this patch could be about 20% of its
current size and give us everything that we need. I've previously
opposed discarding s_lock.h on the grounds that it's extremely well
battle-tested, and if we change it, we might introduce subtle bugs
that are dependent on GCC versions and such. But now that compiler
intrinsics are relatively common, I don't really see any reason for us
to provide our own assembler versions of *new* primitives we want to
use. As long as we have some kind of wrapper functions or macros, we
retain the *option* to add workarounds for compiler bugs or lack of
compiler support on platforms, but it seems an awful lot simpler to me
to start by assuming that that the compiler will DTRT, and only roll
our own if that proves to be necessary. It's not like our hand-rolled
implementations couldn't have bugs - which is different from s_lock.h,
where we have evidence that the exact coding in that file does or did
work on those platforms.

Similarly, I would rip out - or separate into a separate patch - the
code that tries to use atomic operations to implement TAS if we don't
already have an implementation in s_lock.h. That might be worth
doing, if there are any common architectures that don't already have
TAS, or to simplify porting to new platforms, but it seems like a
separate thing from what this patch is mainly about, which is enabling
the use of CAS and related operations on platforms that support them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#39)
Re: better atomics - v0.5

Hi,

On 2014-06-25 15:54:37 -0400, Robert Haas wrote:

- The new argument to s_init_lock_sema() isn't used.

It's used when atomics fallback to spinlocks which fall back to
semaphores. c.f. atomics.c.

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

- Are the regression tests in regress.c likely to catch anything?
Really?

They already cought several bugs. If the operations were immediately
widely used they probably wouldn't be necessary.

- The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
seems to be testing for __builtin_constant_p. I don't see that being
used in the patch anywhere, though; and also, why doesn't the naming
match?

Uh. that's a rebase screweup. Should entirely be gone.

- s_lock.h adds some fallback code to use atomics to define TAS on
platforms where GCC supports atomics but where we do not have a
working TAS implementation. However, this supposed fallback code
defines HAS_TEST_AND_SET unconditionally, so I think it will get used
even if we don't have (or have disabled via configure) atomics.

Hm. Good point. It should protect against that.

- I completely loathe the file layout you've proposed in
src/include/storage. If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

- What's the point of defining pg_read_barrier() to be
pg_read_barrier_impl() and pg_write_barrier() to be
pg_write_barrier_impl() and so forth? That seems like unnecessary
notation.

We could forgo it for pg_read_barrier() et al, but for the actual
atomics it's useful because it allows to put common checks in the !impl
routines.

- Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
variable instead of re-executing it every time? Granted, the compiler
might inline this somehow, but...

I sure hope it will, but still a good idea.

More generally, I'm really wondering if you're making the
compare-and-swap implementation a lot more complicated than it needs
to be.

Possible.

How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches?

The discussion so far seemed pretty clear that we can't regress somewhat
frequently used platforms. And dropping support for all but msvc and gcc
would end up doing that. We're going to have to do the legword for the
most common platforms... Note that I didn't use assembler for those, but
relied on intrinsics...

I suspect this patch could be about 20% of its current size and give
us everything that we need.

I think the only way to do that would be to create a s_lock.h like
maze. Which imo is a utterly horrible idea. I'm pretty sure there'll be
more assembler implementations for 9.5 and unless we think ahead of how
to structure that we'll create something bad.
I also think that a year or so down the road we're going to support more
operations. E.g. optional support for a 16byte CAS can allow for some
awesome things when you want to do lockless stuff on a 64bit platform.

I think there's chances for reducing the size by merging i386/amd64,
that looked better earlier than it does now (due to the addition of the
inline assembler).

I've previously
opposed discarding s_lock.h on the grounds that it's extremely well
battle-tested, and if we change it, we might introduce subtle bugs
that are dependent on GCC versions and such.

I think we should entirely get rid of s_lock.h. It's an unmaintainable
mess. That's what I'd done in an earlier version of the patch, but you'd
argued against it. I think you're right in that it shouldn't be done in
the same patch and possibly not even in the same cycle, but I think we
better do it sometime not too far away.
I e.g. don't see how we can guarantee sane barrier semantics with the
current implementation.

But now that compiler
intrinsics are relatively common, I don't really see any reason for us
to provide our own assembler versions of *new* primitives we want to
use. As long as we have some kind of wrapper functions or macros, we
retain the *option* to add workarounds for compiler bugs or lack of
compiler support on platforms, but it seems an awful lot simpler to me
to start by assuming that that the compiler will DTRT, and only roll
our own if that proves to be necessary. It's not like our hand-rolled
implementations couldn't have bugs - which is different from s_lock.h,
where we have evidence that the exact coding in that file does or did
work on those platforms.

I added the x86 inline assembler because a fair number of buildfarm
animals use ancient gcc's and because I could easily test it. It's also
more efficient for gcc < 4.6. I'm not wedded to keeping it.

Similarly, I would rip out - or separate into a separate patch - the
code that tries to use atomic operations to implement TAS if we don't
already have an implementation in s_lock.h. That might be worth
doing, if there are any common architectures that don't already have
TAS, or to simplify porting to new platforms, but it seems like a
separate thing from what this patch is mainly about, which is enabling
the use of CAS and related operations on platforms that support them.

Happy to separate that out. It's also very useful to test code btw...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#40)
Re: better atomics - v0.5

On 06/25/2014 11:36 PM, Andres Freund wrote:

- I completely loathe the file layout you've proposed in
src/include/storage. If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

I think having a separate file for each architecture is nice. I totally
agree that they don't belong in src/include/storage, though. s_lock.h
has always been misplaced there, but we've let it be for historical
reasons, but now that we're adding a dozen new files, it's time to move
them out.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#41)
Re: better atomics - v0.5

Heikki Linnakangas wrote:

On 06/25/2014 11:36 PM, Andres Freund wrote:

- I completely loathe the file layout you've proposed in
src/include/storage. If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

I think having a separate file for each architecture is nice. I
totally agree that they don't belong in src/include/storage, though.
s_lock.h has always been misplaced there, but we've let it be for
historical reasons, but now that we're adding a dozen new files,
it's time to move them out.

We also have a bunch of files in src/backend/storage/lmgr, both current
and introduced by this patch, which would be good to relocate. (Really,
how did lmgr/ end up in storage/ in the first place?)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#40)
Re: better atomics - v0.5

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches?

The discussion so far seemed pretty clear that we can't regress somewhat
frequently used platforms. And dropping support for all but msvc and gcc
would end up doing that. We're going to have to do the legword for the
most common platforms... Note that I didn't use assembler for those, but
relied on intrinsics...

We can't *regress* somewhat-frequently used platforms, but that's
different from saying we've got to support *new* facilities on those
platforms. Essentially the entire buildfarm is running either gcc,
some Microsoft compiler, or a compiler that's supports the same
atomics intrinsics as gcc i.e. icc or clang. Some of those compilers
may be too old to support the atomics intrinsics, and there's one Sun
Studio animal, but I don't know that we need to care about those
things in this patch...

...unless of course the atomics fallbacks in this patch are
sufficiently sucky that anyone who ends up using those is going to be
sad. Then the bar is a lot higher. But if that's the case then I
wonder if we're really on the right course here.

I added the x86 inline assembler because a fair number of buildfarm
animals use ancient gcc's and because I could easily test it. It's also
more efficient for gcc < 4.6. I'm not wedded to keeping it.

Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
implementations aren't that good, that's a pretty good argument for
keeping our own implementations around. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#41)
Re: better atomics - v0.5

On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think having a separate file for each architecture is nice. I totally
agree that they don't belong in src/include/storage, though. s_lock.h has
always been misplaced there, but we've let it be for historical reasons, but
now that we're adding a dozen new files, it's time to move them out.

I find the current organization pretty confusing, but maybe that could
be solved by better documentation of what's supposed to go in each
architecture or compiler-dependent file.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#43)
Re: better atomics - v0.5

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches?

The discussion so far seemed pretty clear that we can't regress somewhat
frequently used platforms. And dropping support for all but msvc and gcc
would end up doing that. We're going to have to do the legword for the
most common platforms... Note that I didn't use assembler for those, but
relied on intrinsics...

We can't *regress* somewhat-frequently used platforms, but that's
different from saying we've got to support *new* facilities on those
platforms.

Well, we want to use the new facilities somewhere. And it's not entirely
unfair to call it a regression if other os/compiler/arch combinations
speed up but yours don't.

Essentially the entire buildfarm is running either gcc,
some Microsoft compiler, or a compiler that's supports the same
atomics intrinsics as gcc i.e. icc or clang. Some of those compilers
may be too old to support the atomics intrinsics, and there's one Sun
Studio animal, but I don't know that we need to care about those
things in this patch...

I think we should support those where it's easy and where we'll see the
breakage on the buildfarm. Sun studio's intrinsics are simple enough
that I don't think my usage of them will be too hard to fix.

...unless of course the atomics fallbacks in this patch are
sufficiently sucky that anyone who ends up using those is going to be
sad. Then the bar is a lot higher. But if that's the case then I
wonder if we're really on the right course here.

I don't you'll notice it much on mostly nonconcurrent uses. But some of
those architectures/platforms do support larger NUMA like setups. And
for those it'd probably hurt for some workloads. And e.g. solaris is
still fairly common for such setups.

I added the x86 inline assembler because a fair number of buildfarm
animals use ancient gcc's and because I could easily test it. It's also
more efficient for gcc < 4.6. I'm not wedded to keeping it.

Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
implementations aren't that good, that's a pretty good argument for
keeping our own implementations around. :-(

The difference isn't that big. It's
return __atomic_compare_exchange_n(&ptr->value, expected, newval,
false, __ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST);
vs

bool ret;
uint64 current;
current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
ret = current == *expected;
*expected = current;
return ret;

On x86 that's an additional cmp, mov and such. Useless pos because
cmpxchg already computes all that... (that's returned by the setz in my
patch). MSVC only supports the second form as well.

There's a fair number of < 4.2 gccs on the buildfarm as well though.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#44)
Re: better atomics - v0.5

On 2014-06-25 20:22:31 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think having a separate file for each architecture is nice. I totally
agree that they don't belong in src/include/storage, though. s_lock.h has
always been misplaced there, but we've let it be for historical reasons, but
now that we're adding a dozen new files, it's time to move them out.

I find the current organization pretty confusing, but maybe that could
be solved by better documentation of what's supposed to go in each
architecture or compiler-dependent file.

The idea is that first a architecture specific file (atomics-arch-*.h)
is included. That file can provide a (partial) implementation for the
specific architecture. Or it can do pretty much nothing.

After that a compiler specific file is included
(atomics-generic-*.h). If atomics aren't yet implemented that can
provide an intrinsics based implementation if the compiler version has
support for it. At the very least a compiler barrier should be provided.

After that the spinlock based fallback implementation
(atomics-fallback.h) provides atomics and barriers if not yet
available. By here we're sure that nothing else will provide them.

Then we can provide operations (atomics-generic.h) that build ontop of
the provided functions. E.g. implement _sub, _and et al.

I'll include some more of that explanation in the header.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#45)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

That rule seems reasonable -- why would you ever want to do this?
While you couldn't properly deadlock it seems like it could lead to
unpredictable and hard to diagnose performance stalls.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Andres Freund
andres@2ndquadrant.com
In reply to: Merlin Moncure (#47)
Re: better atomics - v0.5

On 2014-06-26 07:44:14 -0500, Merlin Moncure wrote:

On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

That rule seems reasonable -- why would you ever want to do this?

Are you wondering why you'd ever manipulate an atomic op inside a
spinlock?
There's enough algorithms where a slowpath is done under a spinlock but
the fastpath is done without. You can't simply use nonatomic operations
for manipulation under the spinlock because the fastpaths might then
observe/cause bogus state.

Obviously I'm not advocating to do random stuff in spinlocks. W're
talking about single lock xadd;s (or the LL/SC) equivalent here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#45)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 12:20:06PM +0200, Andres Freund wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

I agree it's valuable to have a design that permits invoking an atomic
operation while holding a spinlock.

I added the x86 inline assembler because a fair number of buildfarm
animals use ancient gcc's and because I could easily test it. It's also
more efficient for gcc < 4.6. I'm not wedded to keeping it.

Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
implementations aren't that good, that's a pretty good argument for
keeping our own implementations around. :-(

GCC 4.6 was released 2011-03-25, though that doesn't change the bottom line.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#45)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

Yeah, which is why we have a rule that you're not suppose to acquire
another spinlock while already holding one. You're trying to make the
spinlocks used to simulate atomic ops an exception to that rule, but
I'm not sure that's OK. An atomic op is a lot more expensive than a
regular unlocked load or store even if it doesn't loop, and if it does
loop, it's worse, potentially by a large multiple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#46)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 7:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:22:31 -0400, Robert Haas wrote:

On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think having a separate file for each architecture is nice. I totally
agree that they don't belong in src/include/storage, though. s_lock.h has
always been misplaced there, but we've let it be for historical reasons, but
now that we're adding a dozen new files, it's time to move them out.

I find the current organization pretty confusing, but maybe that could
be solved by better documentation of what's supposed to go in each
architecture or compiler-dependent file.

The idea is that first a architecture specific file (atomics-arch-*.h)
is included. That file can provide a (partial) implementation for the
specific architecture. Or it can do pretty much nothing.

After that a compiler specific file is included
(atomics-generic-*.h). If atomics aren't yet implemented that can
provide an intrinsics based implementation if the compiler version has
support for it. At the very least a compiler barrier should be provided.

After that the spinlock based fallback implementation
(atomics-fallback.h) provides atomics and barriers if not yet
available. By here we're sure that nothing else will provide them.

Then we can provide operations (atomics-generic.h) that build ontop of
the provided functions. E.g. implement _sub, _and et al.

I'll include some more of that explanation in the header.

I get the general principle, but I think it would be good to have one
place that says something like:

Each architecture must provide A and B, may provide both or neither of
C and D, and may also any or all of E, F, and G.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#50)
Re: better atomics - v0.5

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

Yeah, which is why we have a rule that you're not suppose to acquire
another spinlock while already holding one. You're trying to make the
spinlocks used to simulate atomic ops an exception to that rule, but
I'm not sure that's OK. An atomic op is a lot more expensive than a
regular unlocked load or store even if it doesn't loop, and if it does
loop, it's worse, potentially by a large multiple.

Well, the point here is to be sure that there's a reasonably small bound
on how long we hold the spinlock. It doesn't have to be zero, just small.

Would it be practical to say that the coding rule is that you can only
use atomic ops on fields that are protected by the spinlock, ie, nobody
else is supposed to be touching those fields while you have the spinlock?
If that's the case, then the atomic op should see no contention and thus
not take very long.

On the other hand, if the atomic op is touching something not protected
by the spinlock, that seems to me to be morally equivalent to taking a
spinlock while holding another one, which as Robert says is forbidden
by our current coding rules, and for very good reasons IMO.

However, that may be safe only as long as we have real atomic ops.
If we're simulating those using spinlocks then you have to ask questions
about how many underlying spinlocks there are and whether artificial
contention might ensue (due to the same spinlock being used for multiple
things).

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#52)
Re: better atomics - v0.5

On 2014-06-26 11:47:15 -0700, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-25 20:16:08 -0400, Robert Haas wrote:

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule. Most atomics, by nature, are
loop-until-it-works.

Well, so is TAS itself :).

Yeah, which is why we have a rule that you're not suppose to acquire
another spinlock while already holding one. You're trying to make the
spinlocks used to simulate atomic ops an exception to that rule, but
I'm not sure that's OK. An atomic op is a lot more expensive than a
regular unlocked load or store even if it doesn't loop, and if it does
loop, it's worse, potentially by a large multiple.

Well, the point here is to be sure that there's a reasonably small bound
on how long we hold the spinlock. It doesn't have to be zero, just small.

Would it be practical to say that the coding rule is that you can only
use atomic ops on fields that are protected by the spinlock, ie, nobody
else is supposed to be touching those fields while you have the spinlock?
If that's the case, then the atomic op should see no contention and thus
not take very long.

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

On the other hand, if the atomic op is touching something not protected
by the spinlock, that seems to me to be morally equivalent to taking a
spinlock while holding another one, which as Robert says is forbidden
by our current coding rules, and for very good reasons IMO.

However, that may be safe only as long as we have real atomic ops.
If we're simulating those using spinlocks then you have to ask questions
about how many underlying spinlocks there are and whether artificial
contention might ensue (due to the same spinlock being used for multiple
things).

With the code as submitted it's one spinlock per atomic variable. Which
is fine until spinlocks are emulated using semaphores - in that case a
separate array of semaphores (quite small in the patch as submitted) is
used so there's no chance of deadlocks against a 'real' spinlock or
anything like that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Merlin Moncure
mmoncure@gmail.com
In reply to: Tom Lane (#52)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
Would it be practical to say that the coding rule is that you can only
use atomic ops on fields that are protected by the spinlock, ie, nobody
else is supposed to be touching those fields while you have the spinlock?
If that's the case, then the atomic op should see no contention and thus
not take very long.

I wonder if this is true in all cases. The address you are locking
might be logically protected but at the same time nearby to other
memory that is under contention. In other words, I don't think you
can assume an atomic op locks exactly 4 bytes of memory for the
operation.

On the other hand, if the atomic op is touching something not protected
by the spinlock, that seems to me to be morally equivalent to taking a
spinlock while holding another one, which as Robert says is forbidden
by our current coding rules, and for very good reasons IMO.

Well not quite: you don't have the possibility of deadlock.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#53)
Re: better atomics - v0.5

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock? That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#55)
Re: better atomics - v0.5

On 2014-06-27 13:15:25 -0400, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock? That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#56)
Re: better atomics - v0.5

On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-27 13:15:25 -0400, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock? That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

You might be right, but I'm not convinced. Does the lwlock
scalability patch work this way, too? I was hoping to look at that
RSN, so if there are roughly the same issues there it might shed some
light on this point also.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes. That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot. ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#36)
Re: better atomics - v0.5

On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up. All try to implement TAS, 32 bit
atomic add, 32 bit compare exchange; some do it for 64bit as well.

I have reviewed msvc part of patch and below are my findings:

1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(&ptr->value, newval, *expected);

Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?
*Acquire or *Release variants can provide acquire memory ordering
semantics for processors which support the same else will be defined
as InterlockedCompareExchange.

2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}

Can we implement this API without changing expected value?
I mean if the InterlockedCompareExchange() is success, then it will
store the newval in ptr->value, else *ret* will be false.
I am not sure if there is anything implementation specific in changing
*expected*.

I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it. This will have another
advantage that our pg_* API will also have similar signature as native
API's.

3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?

I could not find any evidence which can clearly indicate, if one is better
than other except some recommendation in below link which suggests to
*maintain reference count*, use Increment/Decrement

Refer *The Locking Cookbook* in below link:
http://blogs.technet.com/b/winserverperformance/archive/2008/05/21/designing-applications-for-high-performance-part-ii.aspx

However, when I tried to check the instructions each function generates,
then I don't find anything which suggests that *Add variant can be slow.

Refer Instructions for both functions:

cPublicProc _InterlockedIncrement,1
cPublicFpo 1,0
mov ecx,Addend ; get pointer to addend variable
mov eax,1 ; set increment value
Lock1:
lock xadd [ecx],eax ; interlocked increment
inc eax ; adjust return value
stdRET _InterlockedIncrement ;

stdENDP _InterlockedIncrement

cPublicProc _InterlockedExchangeAdd, 2
cPublicFpo 2,0

mov ecx, [esp + 4] ; get addend address
mov eax, [esp + 8] ; get increment value
Lock4:
lock xadd [ecx], eax ; exchange add

stdRET _InterlockedExchangeAdd

stdENDP _InterlockedExchangeAdd

For details, refer link:
http://read.pudn.com/downloads3/sourcecode/windows/248345/win2k/private/windows/base/client/i386/critsect.asm__.htm

I don't think there is any need of change in current implementation,
however I just wanted to share my analysis, so that if any one else
can see any point in preferring one or the other way of implementation.

4.
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier_impl() _ReadWriteBarrier()

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

Please refer below link:
http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.110).aspx

When I checked previous versions of MSVC, I noticed that these are
supported on x86, IPF, x64 till VS2008 and supported only on x86, x64
for VS2012 onwards.

Link for VS2010 support:
http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.100).aspx

5.
+ * atomics-generic-msvc.h
..
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group

Shouldn't copyright notice be 2014?

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related functions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#59Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#58)
Re: better atomics - v0.5

On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:

On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up. All try to implement TAS, 32 bit
atomic add, 32 bit compare exchange; some do it for 64bit as well.

I have reviewed msvc part of patch and below are my findings:

Thanks for looking at it - I pretty much wrote that part blindly.

1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(&ptr->value, newval, *expected);

Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?

Two things:
a) compare_exchange is a read/write operator and so far I've defined it
to have full barrier semantics (documented in atomics.h). I'd be
happy to discuss which semantics we want for which operation.
b) It's only supported from vista onwards. Afaik we still support XP.
c) It doesn't make any difference on x86 ;)

*Acquire or *Release variants can provide acquire memory ordering
semantics for processors which support the same else will be defined
as InterlockedCompareExchange.

2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}

Can we implement this API without changing expected value?
I mean if the InterlockedCompareExchange() is success, then it will
store the newval in ptr->value, else *ret* will be false.
I am not sure if there is anything implementation specific in changing
*expected*.
I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it. This will have another
advantage that our pg_* API will also have similar signature as native
API's.

Many usages of compare/exchange require that you'll get the old value
back in an atomic fashion. Unfortunately the Interlocked* API doesn't
provide a nice way to do that. Note that this definition of
compare/exchange both maps nicely to C11's atomics and the actual x86
cmpxchg instruction.

I've generally tried to mimick C11's API as far as it's
possible. There's some limitations because we can't do generic types and
such, but other than that.

3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?

There's a theoretical difference for IA64 which has a special
increment/decrement operation which can only change the value by a
rather limited number of values. I don't think it's worth to care.

4.
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier_impl() _ReadWriteBarrier()

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related functions.

Check atomics.h, there's comments above it:
/*
* pg_atomic_compare_exchange_u32 - CAS operation
*
* Atomically compare the current value of ptr with *expected and store newval
* iff ptr and *expected have the same value. The current value of *ptr will
* always be stored in *expected.
*
* Return whether the values have been exchanged.
*
* Full barrier semantics.
*/

Thanks,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#57)
Re: better atomics - v0.5

On 2014-06-27 22:47:02 -0400, Robert Haas wrote:

On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-27 13:15:25 -0400, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock? That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

You might be right, but I'm not convinced.

Why? Anything I can do to convince you of this? Note that other users of
atomics (e.g. the linux kernel) widely use atomics inside spinlock
protected regions.

Does the lwlock scalability patch work this way, too?

No. I've moved one add/sub inside a critical section in the current
version while debugging, but it's not required at all. I generally think
it makes sense to document the suggestion of moving atomics outside, but
not make it a requirement.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes. That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot. ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

Well, I'm not saying it's something I suggest doing all the time. But if
using an atomic op in the slow path allows you to remove the spinlock
from 99% of the cases I don't see it having a significant impact.
In most scenarios (where atomics aren't emulated, i.e. platforms we
expect to used in heavily concurrent cases) the spinlock and the atomic
will be on the same cacheline making stalls much less likely.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Robert Haas (#55)
Re: better atomics - v0.5

On 06/27/2014 08:15 PM, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock?

With the WALInsertLock scaling stuff in 9.4, there are now two variables
protected by a spinlock: the current WAL insert location, and the prev
pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you
need to grab the spinlock to update both of them atomically. But to just
read the WAL insert pointer, you could do an atomic read of CurrBytePos
if the architecture supports it - now you have to grab the spinlock.

Admittedly that's just an atomic read, not an atomic compare and
exchange or fetch-and-add. Or if the architecture has an atomic 128-bit
compare & exchange op you could replace the spinlock with that. But it's
not hard to imagine similar situations where you sometimes need to lock
a larger data structure to modify it atomically, but sometimes you just
need to modify part of it and an atomic op would suffice.

I thought Andres' LWLock patch also did something like that. If the lock
is not contended, you can acquire it with an atomic compare & exchange
to increment the exclusive/shared counter. But to manipulate the wait
queue, you need the spinlock.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#59)
Re: better atomics - v0.5

On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:

1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(&ptr->value, newval, *expected);

Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?

Two things:
a) compare_exchange is a read/write operator and so far I've defined it
to have full barrier semantics (documented in atomics.h). I'd be
happy to discuss which semantics we want for which operation.

I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile. I think it might not be sane in all cases,
example with Visual Studio 2003, volatile to volatile references are
ordered; the compiler will not re-order volatile variable access. However,
these operations could be re-ordered by the processor.
For detailed example, refer below link:

http://msdn.microsoft.com/en-us/library/windows/desktop/ms686355(v=vs.85).aspx

Also if we see C11 specs both load and store uses memory order as
memory_order_seq_cst by default which is same as for compare and
exchange
I have referred below link:
http://en.cppreference.com/w/c/atomic/atomic_store

b) It's only supported from vista onwards. Afaik we still support XP.

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

c) It doesn't make any difference on x86 ;)

What about processors like Itanium which care about acquire/release
memory barrier semantics?

2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}

Can we implement this API without changing expected value?

..

I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it. This will have

another

advantage that our pg_* API will also have similar signature as native
API's.

Many usages of compare/exchange require that you'll get the old value
back in an atomic fashion. Unfortunately the Interlocked* API doesn't
provide a nice way to do that.

Yes, but I think the same cannot be accomplished even by using
expected.

Note that this definition of
compare/exchange both maps nicely to C11's atomics and the actual x86
cmpxchg instruction.

I've generally tried to mimick C11's API as far as it's
possible. There's some limitations because we can't do generic types and
such, but other than that.

If I am reading C11's spec for this API correctly, then it says as below:
"Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual value
pointed to by obj into *expected (performs load operation)."

So it essentialy means that it loads actual value in expected only if
operation is not successful.

3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?

There's a theoretical difference for IA64 which has a special
increment/decrement operation which can only change the value by a
rather limited number of values. I don't think it's worth to care.

Okay.

4.

..

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related

functions.

Check atomics.h, there's comments above it:
/*
* pg_atomic_compare_exchange_u32 - CAS operation
*
* Atomically compare the current value of ptr with *expected and store

newval

* iff ptr and *expected have the same value. The current value of *ptr

will

* always be stored in *expected.
*
* Return whether the values have been exchanged.
*
* Full barrier semantics.
*/

Okay, generally code has such comments on top of function
definition rather than with declaration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#63Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#62)
Re: better atomics - v0.5

On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:

On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>

Two things:
a) compare_exchange is a read/write operator and so far I've defined it
to have full barrier semantics (documented in atomics.h). I'd be
happy to discuss which semantics we want for which operation.

I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile.

Yes, intentionally so. It's often important to get/set the current value
without the cost of emitting a memory barrier. It just has to be a
recent value and it actually has to come from memory. And that's actually
enforced by the current definition - I think?

b) It's only supported from vista onwards. Afaik we still support XP.

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

I think that's odd because barrier.h has been using MemoryBarrier()
without a version test for a long time now. I guess everyone uses a new
enough visual studio. Those intrinsics aren't actually OS but just
compiler dependent.

Otherwise we could just define it as:

FORCEINLINE
VOID
MemoryBarrier (
VOID
)
{
LONG Barrier;
__asm {
xchg Barrier, eax
}
}

c) It doesn't make any difference on x86 ;)

What about processors like Itanium which care about acquire/release
memory barrier semantics?

Well, it still will be correct? I don't think it makes much sense to
focus overly much on itanium here with the price of making anything more
complicated for others.

I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it. This will have

another

advantage that our pg_* API will also have similar signature as native
API's.

Many usages of compare/exchange require that you'll get the old value
back in an atomic fashion. Unfortunately the Interlocked* API doesn't
provide a nice way to do that.

Yes, but I think the same cannot be accomplished even by using
expected.

More complicatedly so, yes? I don't think we want those comparisons on
practically every callsite.

Note that this definition of
compare/exchange both maps nicely to C11's atomics and the actual x86
cmpxchg instruction.

I've generally tried to mimick C11's API as far as it's
possible. There's some limitations because we can't do generic types and
such, but other than that.

If I am reading C11's spec for this API correctly, then it says as below:
"Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual value
pointed to by obj into *expected (performs load operation)."

So it essentialy means that it loads actual value in expected only if
operation is not successful.

Yes. But in the case it's successfull it will already contain the right
value.

4.

..

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.

Well, they refer to C11 stuff. But I think it'll be a while before it
makes sense to use a fallback based on that.

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related

functions.

Check atomics.h, there's comments above it:
/*
* pg_atomic_compare_exchange_u32 - CAS operation
*
* Atomically compare the current value of ptr with *expected and store

newval

* iff ptr and *expected have the same value. The current value of *ptr

will

* always be stored in *expected.
*
* Return whether the values have been exchanged.
*
* Full barrier semantics.
*/

Okay, generally code has such comments on top of function
definition rather than with declaration.

I don't want to have it on the _impl functions because they're
duplicated for the individual platforms and will just become out of
date...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#61)
Re: better atomics - v0.5

On 2014-06-28 19:36:39 +0300, Heikki Linnakangas wrote:

On 06/27/2014 08:15 PM, Robert Haas wrote:

On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock?

With the WALInsertLock scaling stuff in 9.4, there are now two variables
protected by a spinlock: the current WAL insert location, and the prev
pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need
to grab the spinlock to update both of them atomically. But to just read the
WAL insert pointer, you could do an atomic read of CurrBytePos if the
architecture supports it - now you have to grab the spinlock.
Admittedly that's just an atomic read, not an atomic compare and exchange or
fetch-and-add.

There's several architectures where you can do atomic 8byte reads, but
only busing cmpxchg or similar, *not* using a plain read... So I think
it's actually a fair example.

Or if the architecture has an atomic 128-bit compare &
exchange op you could replace the spinlock with that. But it's not hard to
imagine similar situations where you sometimes need to lock a larger data
structure to modify it atomically, but sometimes you just need to modify
part of it and an atomic op would suffice.

Yes.

I thought Andres' LWLock patch also did something like that. If the lock is
not contended, you can acquire it with an atomic compare & exchange to
increment the exclusive/shared counter. But to manipulate the wait queue,
you need the spinlock.

Right. It just happens that the algorithm requires none of the atomic
ops have to be under the spinlock... But I generally think that we'll
see more slow/fastpath like situations cropping up where we can't always
avoid the atomic op while holding the lock.

How about adding a paragraph that explains that it's better to avoid
atomics usage of spinlocks because of the prolonged runtime, especially
due to the emulation and cache misses? But also saying it's ok if the
algorithm needs it and is a sufficient benefit?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#63)
Re: better atomics - v0.5

On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:

On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile.

Yes, intentionally so. It's often important to get/set the current value
without the cost of emitting a memory barrier. It just has to be a
recent value and it actually has to come from memory.

I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.

And that's actually
enforced by the current definition - I think?

Yeah, this is the only point which I am trying to ensure, thats why I sent
you few links in previous mail where I had got some suspicion that
just doing get/set with volatile might lead to correctness issue in some
cases.

Some another Note, I found today while reading more on it which suggests
that my previous observation is right:

"Within a thread of execution, accesses (reads and writes) to all volatile
objects are guaranteed to not be reordered relative to each other, but this
order is not guaranteed to be observed by another thread, since volatile
access does not establish inter-thread synchronization."
http://en.cppreference.com/w/c/atomic/memory_order

b) It's only supported from vista onwards. Afaik we still support XP.

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

I think that's odd because barrier.h has been using MemoryBarrier()
without a version test for a long time now. I guess everyone uses a new
enough visual studio.

Yeap or might be the people who even are not using new enough version
doesn't have a use case that can hit a problem due to MemoryBarrier()

Those intrinsics aren't actually OS but just
compiler dependent.

Otherwise we could just define it as:

FORCEINLINE
VOID
MemoryBarrier (
VOID
)
{
LONG Barrier;
__asm {
xchg Barrier, eax
}
}

Yes, I think it will be better, how about defining this way just for
the cases where MemoryBarrier macro is not supported.

c) It doesn't make any difference on x86 ;)

What about processors like Itanium which care about acquire/release
memory barrier semantics?

Well, it still will be correct? I don't think it makes much sense to
focus overly much on itanium here with the price of making anything more
complicated for others.

Completely agreed that we should not add or change anything which
makes patch complicated or can effect the performance, however
the reason for focusing is just to ensure that we should not break
any existing use case for Itanium w.r.t performance or otherwise.

Another way is that we can give ignore or just give lower priority for
verfiying if the patch has anything wrong for Itanium processors.

If I am reading C11's spec for this API correctly, then it says as

below:

"Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual

value

pointed to by obj into *expected (performs load operation)."

So it essentialy means that it loads actual value in expected only if
operation is not successful.

Yes. But in the case it's successfull it will already contain the right
value.

The point was just that we are not completely following C11 atomic
specification, so it might be okay to tweak a bit and make things
simpler and save LOAD instruction. However as there is no correctness
issue here and you think that current implementation is good and
can serve more cases, we can keep as it is and move forward.

4.

..

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.

Well, they refer to C11 stuff. But I think it'll be a while before it
makes sense to use a fallback based on that.

In this case, I have a question for you.

Un-patched usage in barrier.h is as follows:
..
#elif defined(__ia64__) || defined(__ia64)
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()

#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()
#endif

If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for windows.

However the patch defines as below:

#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
# include "storage/atomics-generic-gcc.h"
#elif defined(WIN32_ONLY_COMPILER)
# include "storage/atomics-generic-msvc.h"

What I can understand from above is that defines in
storage/atomics-generic-msvc.h, will override any previous defines
for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
will be considered for Windows always.

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related

functions.
Okay, generally code has such comments on top of function
definition rather than with declaration.

I don't want to have it on the _impl functions because they're
duplicated for the individual platforms and will just become out of
date...

Sure, I was also not asking for _impl functions. What I was asking
in this point was to have comments on top of definition of
pg_atomic_compare_exchange_u32() in atomics.h
In particular, on top of below and similar functions, rather than
at the place where they are declared.

STATIC_IF_INLINE bool

pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
{
CHECK_POINTER_ALIGNMENT(ptr, 4);
CHECK_POINTER_ALIGNMENT(expected, 4);

return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#66Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#65)
Re: better atomics - v0.5

On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:

On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:

On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile.

Yes, intentionally so. It's often important to get/set the current value
without the cost of emitting a memory barrier. It just has to be a
recent value �and it actually has to come from memory.

I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.

I really can't follow here. A volatile read/store forces it to go to
memory without barriers. The ABIs of all platforms we work with
guarantee that 4bytes stores/reads are atomic - we've been relying on
that for a long while.

And that's actually
enforced by the current definition - I think?

Yeah, this is the only point which I am trying to ensure, thats why I sent
you few links in previous mail where I had got some suspicion that
just doing get/set with volatile might lead to correctness issue in some
cases.

But this isn't something this patch started doing - we've been doing
that for a long while?

Some another Note, I found today while reading more on it which suggests
that my previous observation is right:

"Within a thread of execution, accesses (reads and writes) to all�volatile
objects are guaranteed to not be reordered relative to each other, but this
order is not guaranteed to be observed by another thread, since volatile
access does not establish inter-thread synchronization."
http://en.cppreference.com/w/c/atomic/memory_order

That's just saying there's no ordering guarantees with volatile
stores/reads. I don't see a contradiction?

b) It's only supported from vista onwards. Afaik we still support XP.

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

I think that's odd because barrier.h has been using MemoryBarrier()
without a version test for a long time now. I guess everyone uses a new
enough visual studio.

Yeap or might be the people who even are not using new enough version
doesn't have a use case that can hit a problem due to MemoryBarrier()�

Well, with barrier.h as it stands they'd get a compiler error if it
indeed is unsupported? But I think there's something else going on -
msft might just be removing XP from it's documentation. Postgres supports
building with with visual studio 2005 and MemoryBarrier() seems to be
supported by it.
I think we'd otherwise seen problems since 9.1 where barrier.h was introduced.

In this case, I have a question for you.

Un-patched usage �in barrier.h is as follows:
..
#elif defined(__ia64__) || defined(__ia64)
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()

#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()
#endif

If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for windows.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

However the patch defines as below:

#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
# include "storage/atomics-generic-gcc.h"
#elif defined(WIN32_ONLY_COMPILER)
# include "storage/atomics-generic-msvc.h"

What I can understand from above is that defines in�
storage/atomics-generic-msvc.h, will override any previous defines
for compiler/memory barriers and�_ReadWriteBarrier()/MemoryBarrier()
will be considered for Windows always.

Well, the memory barrier is surrounded by #ifndef
pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
earlier since it's a compiler not an architecture thing.

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related

functions.
Okay, generally code has such comments on top of function
definition rather than with declaration.

I don't want to have it on the _impl functions because they're
duplicated for the individual platforms and will just become out of
date...

Sure, I was also not asking for _impl functions. �What I was asking
in this point was to have comments on top of definition of
pg_atomic_compare_exchange_u32() in atomics.h
In particular, on top of below and similar functions, rather than
at the place where they are declared.

Hm, we can do that. Don't think it'll be clearer (because you need to go
to the header anyway), but I don't feel strongly.

I'd much rather get rid of the separated definition/declaration, but
we'd need to rely on PG_USE_INLINE for it...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#60)
Re: better atomics - v0.5

On Sat, Jun 28, 2014 at 4:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

You might be right, but I'm not convinced.

Why? Anything I can do to convince you of this? Note that other users of
atomics (e.g. the linux kernel) widely use atomics inside spinlock
protected regions.

The Linux kernel isn't a great example, because they can ensure that
they aren't preempted while holding the spinlock. We can't, and
that's a principle part of my concern here.

More broadly, it doesn't seem consistent. I think that other projects
also sometimes write code that acquires a spinlock while holding
another spinlock, and we don't do that; in fact, we've elevated that
to a principle, regardless of whether it performs well in practice.
In some cases, the CPU instruction that we issue to acquire that
spinlock might be the exact same instruction we'd issue to manipulate
an atomic variable. I don't see how it can be right to say that a
spinlock-protected critical section is allowed to manipulate an atomic
variable with cmpxchg, but not allowed to acquire another spinlock
with cmpxchg. Either acquiring the second spinlock is OK, in which
case our current coding rule is wrong, or acquiring the atomic
variable is not OK, either. I don't see how we can have that both
ways.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes. That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot. ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

Well, I'm not saying it's something I suggest doing all the time. But if
using an atomic op in the slow path allows you to remove the spinlock
from 99% of the cases I don't see it having a significant impact.
In most scenarios (where atomics aren't emulated, i.e. platforms we
expect to used in heavily concurrent cases) the spinlock and the atomic
will be on the same cacheline making stalls much less likely.

And this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line? I don't have a bit of
trouble believing that doing the same thing with a couple of spinlocks
could sometimes work out well, too, but Tom is adamantly opposed to
that. I don't want to just make an end run around that issue without
understanding whether he has a valid point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#67)
Re: better atomics - v0.5

On 2014-06-30 12:15:06 -0400, Robert Haas wrote:

More broadly, it doesn't seem consistent. I think that other projects
also sometimes write code that acquires a spinlock while holding
another spinlock, and we don't do that; in fact, we've elevated that
to a principle, regardless of whether it performs well in practice.

It's actually more than a principle: It's now required for correctness,
because otherwise the semaphore based spinlock implementation will
deadlock otherwise.

In some cases, the CPU instruction that we issue to acquire that
spinlock might be the exact same instruction we'd issue to manipulate
an atomic variable. I don't see how it can be right to say that a
spinlock-protected critical section is allowed to manipulate an atomic
variable with cmpxchg, but not allowed to acquire another spinlock
with cmpxchg. Either acquiring the second spinlock is OK, in which
case our current coding rule is wrong, or acquiring the atomic
variable is not OK, either. I don't see how we can have that both
ways.

The no nested spinlock thing used to be a guideline. One nobody had big
problems with because it didn't prohibit solutions to problems. Since
guidelines are more useful when simple it's "no nested
spinlocks!".

Also those two cmpxchg's aren't the same:

The CAS for spinlock acquiration will only succeed if the lock isn't
acquired. If the lock holder sleeps you'll have to wait for it to wake
up.

In contrast to that CASs for lockfree (or lock-reduced) algorithms won't
be blocked if the last manipulation was done by a backend that's now
sleeping. It'll possibly loop once, get the new value, and retry the
CAS. Yes, it can still take couple of iterations under heavy
concurrency, but you don't have the problem that the lockholder goes to
sleep.

Now, you can argue that the spinlock based fallbacks make that
difference moot, but I think that's just the price for an emulation
fringe platforms will have to pay. They aren't platforms used under
heavy concurrency.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes. That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot. ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

Well, I'm not saying it's something I suggest doing all the time. But if
using an atomic op in the slow path allows you to remove the spinlock
from 99% of the cases I don't see it having a significant impact.
In most scenarios (where atomics aren't emulated, i.e. platforms we
expect to used in heavily concurrent cases) the spinlock and the atomic
will be on the same cacheline making stalls much less likely.

And this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line?

Because it's not relevant? Where and why do we want to acquire two
spinlocks that are on the same cacheline?
As far as I know there's never been an actual need for nested
spinlocks. So the guideline can be as restrictive as is it is because
the price is low and there's no benefit in making it more complex.

I think this line of discussion is pretty backwards. The reason I (and
seemingly Heikki) want to use atomics is that it can reduce contention
significantly. That contention is today too a good part based on
spinlocks. Your argument is basically that increasing spinlock
contention by doing things that can take more than a fixed cycles can
increase contention. But the changes we've talked about only make sense
if they significantly reduce spinlock contention in the first place - a
potential for a couple iterations while holding better not be relevant
for proposed patches.

I don't think a blanket rule makes sense here.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#67)
Re: better atomics - v0.5

Robert Haas <robertmhaas@gmail.com> writes:

... this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line? I don't have a bit of
trouble believing that doing the same thing with a couple of spinlocks
could sometimes work out well, too, but Tom is adamantly opposed to
that.

I think you might be overstating my position here. What I'm concerned
about is that we be sure that spinlocks are held for a sufficiently short
time that it's very unlikely that we get pre-empted while holding one.
I don't have any particular bright line about how short a time that is,
but more than "a few instructions" worries me. As you say, the Linux
kernel is a bad example to follow because it hasn't got a risk of losing
its timeslice while holding a spinlock.

The existing coding rules discourage looping (though I might be okay
with a small constant loop count), and subroutine calls (mainly because
somebody might add $random_amount_of_work to the subroutine if they
don't realize it can be called while holding a spinlock). Both of these
rules are meant to reduce the risk that a short interval balloons
into a long one due to unrelated coding changes.

The existing coding rules also discourage spinlocking within a spinlock,
and the reason for that is that there's no very clear upper bound to the
time required to obtain a spinlock, so that there would also be no clear
upper bound to the time you're holding the original one (thus possibly
leading to cascading performance problems).

So ISTM the question we ought to be asking is whether atomic operations
have bounded execution time, or more generally what the distribution of
execution times is likely to be. I'd be OK with an answer that includes
"sometimes it can be long" so long as "sometimes" is "pretty damn seldom".
Spinlocks have a nonzero risk of taking a long time already, since we
can't entirely prevent the possibility of losing our timeslice while
holding one. The issue here is just to be sure that that happens seldom
enough that it doesn't cause performance problems. If we fail to do that
we might negate all the desired performance improvements from adopting
atomic ops in the first place.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#69)
Re: better atomics - v0.5

On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The existing coding rules also discourage spinlocking within a spinlock,
and the reason for that is that there's no very clear upper bound to the
time required to obtain a spinlock, so that there would also be no clear
upper bound to the time you're holding the original one (thus possibly
leading to cascading performance problems).

Well, as Andres points out, commit
daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
requirement than a suggestion. If it's sometimes OK to acquire a
spinlock from within a spinlock, then that commit was badly misguided;
but the design of that patch was pretty much driven by your insistence
that that was never, ever OK. If that was wrong, then we should
reconsider that whole commit more broadly; but if it was right, then
the atomics patch shouldn't drill a hole through it to install an
exception just for emulating atomics.

I'm personally not convinced that we're approaching this topic in the
right way. I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them. I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations. People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today. If this patch goes in
the way it's written right now, then I think it basically amounts to
throwing platforms that don't have working atomics emulation under the
bus, and my vote is that if we're going to do that, we should do it
explicitly: sorry, we now only support platforms with working atomics.
You don't have any, so you can't run PostgreSQL. Have a nice day.

If we *don't* want to make that decision, then I'm not convinced that
the approach this patch takes is in any way viable, completely aside
from this particular question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#70)
Re: better atomics - v0.5

On 2014-06-30 13:15:23 -0400, Robert Haas wrote:

On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The existing coding rules also discourage spinlocking within a spinlock,
and the reason for that is that there's no very clear upper bound to the
time required to obtain a spinlock, so that there would also be no clear
upper bound to the time you're holding the original one (thus possibly
leading to cascading performance problems).

Well, as Andres points out, commit
daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
requirement than a suggestion. If it's sometimes OK to acquire a
spinlock from within a spinlock, then that commit was badly misguided;
but the design of that patch was pretty much driven by your insistence
that that was never, ever OK. If that was wrong, then we should
reconsider that whole commit more broadly; but if it was right, then
the atomics patch shouldn't drill a hole through it to install an
exception just for emulating atomics.

Well, since nobody has a problem with the rule of not having nested
spinlocks and since the problems aren't the same I'm failing to see why
we should reconsider this.

My recollection was that we primarily discussed whether it'd be a good
idea to add code to Assert() when spinlocks are nested to which Tom
responded that it's not necessary because critical sections should be
short enough to immmediately see that that's not a problem. Then we
argued a bit back and forth around that.

I'm personally not convinced that we're approaching this topic in the
right way. I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them. I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations.

If we're requiring a atomics-less implementation for lwlock.c,
bufmgr. et al. I'll stop working on those features (and by extension on
atomics). It's an utter waste of resources and maintainability trying to
maintain parallel high level locking algorithms in several pieces of the
codebase. The nonatomic versions will stagnate and won't actually work
under load. I'd rather spend my time on something useful. The spinlock
based atomics emulation is *small*. It's easy to reason about
correctness there.

If we're going that way, we've made a couple of absolute fringe
platforms hold postgres, as actually used in reality, hostage. I think
that's insane.

People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.

I think you're overstating the likely performance penalty for
nonparallel platforms/workloads here quite a bit. The platforms without
changes of gettings atomics implemented aren't ones where that's a
likely thing. Yes, you'll see a regression if you run a readonly pgbench
over a 4 node NUMA system - but it's not large. You won't see much of
improved performance in comparison to 9.4, but I think that's primarily
it.

Also it's not like this is something new - the semaphore based fallback
*sucks* but we still have it.

*Many* features in new major versions regress performance for fringe
platforms. That's normal.

If this patch goes in
the way it's written right now, then I think it basically amounts to
throwing platforms that don't have working atomics emulation under the
bus

Why? Nobody is going to run 9.5 under high performance workloads on such
platforms. They'll be so IO and RAM starved that the spinlocks won't be
the bottleneck.

, and my vote is that if we're going to do that, we should do it
explicitly: sorry, we now only support platforms with working atomics.
You don't have any, so you can't run PostgreSQL. Have a nice day.

I'd be fine with that. I don't think we're gaining much by those
platforms.
*But* I don't really see why it's required at this point. The fallback
works and unless you're using semaphore based spinlocks the performance
isn't bad.
The lwlock code doesn't really get slower/faster in comparison to before
when the atomics implementation is backed by semaphores. It's pretty
much the same number of syscalls using the atomics/current implementation.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#70)
Re: better atomics - v0.5

Robert Haas <robertmhaas@gmail.com> writes:

I'm personally not convinced that we're approaching this topic in the
right way. I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them. I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations. People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.

I think this is a valid objection, and I for one am not prepared to
say that we no longer care about platforms that don't have atomic ops
(especially not if it's not a *very small* set of atomic ops).

Also, just because a platform claims to have atomic ops doesn't mean that
those ops perform well. If there's a kernel trap involved, they don't,
at least not for our purposes. We're only going to be bothering with
installing atomic-op code in places that are contention bottlenecks
for us already, so we are not going to be happy with the results for any
atomic-op implementation that's not industrial strength. This is one
reason why I'm extremely suspicious of depending on gcc's intrinsics for
this; that will not make the issue go away, only make it beyond our
power to control.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#69)
Re: better atomics - v0.5

On 2014-06-30 12:54:10 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line? I don't have a bit of
trouble believing that doing the same thing with a couple of spinlocks
could sometimes work out well, too, but Tom is adamantly opposed to
that.

I think you might be overstating my position here. What I'm concerned
about is that we be sure that spinlocks are held for a sufficiently short
time that it's very unlikely that we get pre-empted while holding one.
I don't have any particular bright line about how short a time that is,
but more than "a few instructions" worries me. As you say, the Linux
kernel is a bad example to follow because it hasn't got a risk of losing
its timeslice while holding a spinlock.

Well, they actually have preemptable and irq-interruptile versions for
some of these locks, but whatever :).

So ISTM the question we ought to be asking is whether atomic operations
have bounded execution time, or more generally what the distribution of
execution times is likely to be. I'd be OK with an answer that includes
"sometimes it can be long" so long as "sometimes" is "pretty damn
seldom".

I think it ranges from 'never' to 'sometimes, but pretty darn
seldom'. Depending on the platform and emulation.

And, leaving the emulation and badly written code aside, there's no
issue with another backend sleeping while the atomic variable needs to
be modified.

Spinlocks have a nonzero risk of taking a long time already, since we
can't entirely prevent the possibility of losing our timeslice while
holding one. The issue here is just to be sure that that happens seldom
enough that it doesn't cause performance problems. If we fail to do that
we might negate all the desired performance improvements from adopting
atomic ops in the first place.

I think individual patches will have to stand up to a test like
this. Modifying a atomic variable inside a spinlock is only going to be
worth it if it allows to avoid spinlocks alltogether in 95%+ of the
time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#71)
Re: better atomics - v0.5

On 2014-06-30 19:44:47 +0200, Andres Freund wrote:

On 2014-06-30 13:15:23 -0400, Robert Haas wrote:

People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.

I think you're overstating the likely performance penalty for
nonparallel platforms/workloads here quite a bit. The platforms without
changes of gettings atomics implemented aren't ones where that's a
likely thing. Yes, you'll see a regression if you run a readonly pgbench
over a 4 node NUMA system - but it's not large. You won't see much of
improved performance in comparison to 9.4, but I think that's primarily
it.

To quantify this, on my 2 socket xeon E5520 workstation - which is too
small to heavily show the contention problems in pgbench -S - the
numbers are:

pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
master: 152354.294117
lwlocks-atomics: 170528.784958
lwlocks-atomics-spin: 159416.202578

pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
master: 16273.702191
lwlocks-atomics: 16768.712433
lwlocks-atomics-spin: 16744.913083

So, there really isn't a problem with the emulation. It's not actually
that surprising - the absolute number of atomic ops is prety much the
same. Where we earlier took a spinlock to increment shared we now still
take one.

I expect that repeating this on the 4 socket machine will show a large
gap between lwlocks-atomics and the other two. The other two will be
about the same, with lwlocks-atomics-spin remaining a bit better than
master.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#72)
Re: better atomics - v0.5

On 2014-06-30 13:45:52 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I'm personally not convinced that we're approaching this topic in the
right way. I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them. I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations. People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.

I think this is a valid objection, and I for one am not prepared to
say that we no longer care about platforms that don't have atomic ops
(especially not if it's not a *very small* set of atomic ops).

It's still TAS(), cmpxchg(), xadd. With TAS/xadd possibly implemented
via cmpxchg (which quite possibly *is* the native implementation for
$arch).

Also, just because a platform claims to have atomic ops doesn't mean that
those ops perform well. If there's a kernel trap involved, they don't,
at least not for our purposes.

Yea. I think if we start to use 8byte atomics at some later point we're
going to have to prohibit e.g. older arms from using them, even if the
compiler claims they're supported. But that's just one #define.

This is one
reason why I'm extremely suspicious of depending on gcc's intrinsics for
this; that will not make the issue go away, only make it beyond our
power to control.

The patch as submitted makes it easy to provide a platform specific
implementation of the important routines if it's likely that it's better
than the intrinsics provided one.

I'm not too worried about the intrinsics though - the number of projects
relying on them these days is quite large.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#71)
Re: better atomics - v0.5

On Mon, Jun 30, 2014 at 07:44:47PM +0200, Andres Freund wrote:

On 2014-06-30 13:15:23 -0400, Robert Haas wrote:

I'm personally not convinced that we're approaching this topic in the
right way. I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them. I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations.

If we're requiring a atomics-less implementation for lwlock.c,
bufmgr. et al. I'll stop working on those features (and by extension on
atomics). It's an utter waste of resources and maintainability trying to
maintain parallel high level locking algorithms in several pieces of the
codebase. The nonatomic versions will stagnate and won't actually work
under load. I'd rather spend my time on something useful. The spinlock
based atomics emulation is *small*. It's easy to reason about
correctness there.

If we're going that way, we've made a couple of absolute fringe
platforms hold postgres, as actually used in reality, hostage. I think
that's insane.

I agree completely.

If this patch goes in
the way it's written right now, then I think it basically amounts to
throwing platforms that don't have working atomics emulation under the
bus

, and my vote is that if we're going to do that, we should do it
explicitly: sorry, we now only support platforms with working atomics.
You don't have any, so you can't run PostgreSQL. Have a nice day.

I might share that view if a platform faced a huge and easily-seen performance
regression, say a 40% regression to 5-client performance. I don't expect the
shortfall to reach that ballpark for any atomics-exploiting algorithm we'll
adopt, and your (Andres's) latest measurements corroborate that.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#66)
Re: better atomics - v0.5

On Mon, Jun 30, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:

On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

Yes, intentionally so. It's often important to get/set the current

value

without the cost of emitting a memory barrier. It just has to be a
recent value and it actually has to come from memory.

I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.

I really can't follow here. A volatile read/store forces it to go to
memory without barriers. The ABIs of all platforms we work with
guarantee that 4bytes stores/reads are atomic - we've been relying on
that for a long while.

I think for such usage, we need to rely on barriers wherever there is a
need for synchronisation, recent example I have noticed is in your patch
where we have to use pg_write_barrier() during wakeup. However if we
go by atomic ops definition, then no such dependency is required, like
if somewhere we need to use pg_atomic_compare_exchange_u32(),
after that we don't need to ensure about barriers, because this atomic
API adheres to Full barrier semantics.

I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.

I think we can still rely on volatile read/store for ordering guarantee
within same thread of execution and where ever we need synchronisation
of usage among different processes, we can directly use atomic ops
(get/set) without the need to use barriers.

b) It's only supported from vista onwards. Afaik we still support

XP.

Well, with barrier.h as it stands they'd get a compiler error if it
indeed is unsupported? But I think there's something else going on -
msft might just be removing XP from it's documentation.

Okay, but why would they remove for MemoryBarrier() and not
for InterlockedCompareExchange(). I think it is bit difficult to predict
the reason and if we want to use anything which is not as msft docs,
we shall do that carefully and have some way to ensure that it will
work fine.

In this case, I have a question for you.

Un-patched usage in barrier.h is as follows:
..

If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for

windows.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

Hmm.. Do you think in such a case, it would have gone in below
define:
#elif defined(__INTEL_COMPILER)
/*
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
*/
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif

-#define pg_compiler_barrier() __memory_barrier()

Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?

However the patch defines as below:

..

What I can understand from above is that defines in
storage/atomics-generic-msvc.h, will override any previous defines
for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
will be considered for Windows always.

Well, the memory barrier is surrounded by #ifndef
pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
earlier since it's a compiler not an architecture thing.

I think as per your new compiler specific implementation stuff, this holds
true.

6.

Sure, I was also not asking for _impl functions. What I was asking
in this point was to have comments on top of definition of
pg_atomic_compare_exchange_u32() in atomics.h
In particular, on top of below and similar functions, rather than
at the place where they are declared.

Hm, we can do that. Don't think it'll be clearer (because you need to go
to the header anyway), but I don't feel strongly.

I think this depends on individual's perspective, so do the way
you feel better, the intention of this point was lets not deviate from
existing coding ways.

I'd much rather get rid of the separated definition/declaration, but
we'd need to rely on PG_USE_INLINE for it...

Okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#78Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#77)
Re: better atomics - v0.5

On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:

I think for such usage, we need to rely on barriers wherever there is a
need for synchronisation, recent example I have noticed is in your patch
where we have to use pg_write_barrier() during wakeup. However if we
go by atomic ops definition, then no such dependency is required, like
if somewhere we need to use pg_atomic_compare_exchange_u32(),
after that we don't need to ensure about barriers, because this atomic
API adheres to Full barrier semantics.

I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.

Meh. It's quite common that read/load do not have barrier semantics. The
majority of read/write users won't need barriers and it'll be expensive
to provide them with them.

b) It's only supported from vista onwards. Afaik we still support

XP.

Well, with barrier.h as it stands they'd get a compiler error if it
indeed is unsupported? But I think there's something else going on -
msft might just be removing XP from it's documentation.

Okay, but why would they remove for MemoryBarrier() and not
for InterlockedCompareExchange(). I think it is bit difficult to predict
the reason and if we want to use anything which is not as msft docs,
we shall do that carefully and have some way to ensure that it will
work fine.

Well, we have quite good evidence for it working by knowing that
postgres has in the past worked on xp? If you have a better idea, send
in a patch for today's barrier.h and I'll adopt that.

In this case, I have a question for you.

Un-patched usage in barrier.h is as follows:
..

If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for windows.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

Hmm.. Do you think in such a case, it would have gone in below
define:

#elif defined(__INTEL_COMPILER)
/*
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
*/
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif

Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
acc, I don't see why?

But they should go into atomics-generic-acc.h.

-#define pg_compiler_barrier() __memory_barrier()

Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?

#if defined(__INTEL_COMPILER)
#define pg_compiler_barrier_impl() __memory_barrier()
#else
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")
#endif

There earlier was a typo defining pg_compiler_barrier instead of
pg_compiler_barrier_impl for intel, maybe that's what you were referring to?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Ants Aasma
ants@cybertec.at
In reply to: Andres Freund (#78)
Re: better atomics - v0.5

On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

There is a 2010 paper [1]http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf that demonstrates a fully non-blocking
approach to buffer management using the same generalized clock
algorithm that PostgreSQL has. The site also has an implementation for
Apache Derby. You may find some interesting ideas in there.

[1]: http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Andres Freund
andres@2ndquadrant.com
In reply to: Ants Aasma (#79)
Re: better atomics - v0.5

On 2014-07-02 09:27:52 +0300, Ants Aasma wrote:

On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

There is a 2010 paper [1] that demonstrates a fully non-blocking
approach to buffer management using the same generalized clock
algorithm that PostgreSQL has. The site also has an implementation for
Apache Derby. You may find some interesting ideas in there.

[1] http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf

Interesting. Thanks for the link. I think I had pretty much all the
parts they described lockless as well (excluding the buffer mapping
hashtable itself, which they didn't focus on either), it was just
operations like replacing a dirty victim buffer where I fell back to the
spinlock.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#74)
Re: better atomics - v0.5

Hi,

On 2014-06-30 20:28:52 +0200, Andres Freund wrote:

To quantify this, on my 2 socket xeon E5520 workstation - which is too
small to heavily show the contention problems in pgbench -S - the
numbers are:

pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
master: 152354.294117
lwlocks-atomics: 170528.784958
lwlocks-atomics-spin: 159416.202578

pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
master: 16273.702191
lwlocks-atomics: 16768.712433
lwlocks-atomics-spin: 16744.913083

So, there really isn't a problem with the emulation. It's not actually
that surprising - the absolute number of atomic ops is prety much the
same. Where we earlier took a spinlock to increment shared we now still
take one.

Tom, you've voiced concern that using atomics will regress performance
for platforms that don't use atomics in a nearby thread. Can these
numbers at least ease those a bit?
I've compared the atomics vs emulated atomics on 32bit x86, 64bit x86
and POWER7. That's all I've access to at the moment. In all cases the
performance with lwlocks ontop emulated atomics was the same as the old
spinlock based algorithm or even a bit better. I'm sure that it's
possible to design atomics based algorithms where that's not the case,
but I don't think we need to solve those before we meet them.

I think for most contended places which possibly can be improved two
things matter: The number of instructions that need to work atomically
(i.e. TAS/CAS/xchg for spinlocks, xadd/cmpxchg for atomics) and the
duration locks are held. When converting to atomics, even if emulated,
the hot paths shouldn't need more locked instructions than before and
often enough the duration during which spinlocks are held will be
smaller.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#41)
Re: better atomics - v0.5

On 2014-06-26 00:42:37 +0300, Heikki Linnakangas wrote:

On 06/25/2014 11:36 PM, Andres Freund wrote:

- I completely loathe the file layout you've proposed in
src/include/storage. If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

I think having a separate file for each architecture is nice. I totally
agree that they don't belong in src/include/storage, though. s_lock.h has
always been misplaced there, but we've let it be for historical reasons, but
now that we're adding a dozen new files, it's time to move them out.

So,
src/include/port/atomics.h,
src/include/port/atomics/generic-$compiler.h,
src/include/port/atomics/arch-$arch.h,
src/include/port/atomics/fallback.h,
src/include/port/atomics/generic.h,
src/backend/port/atomics.c
?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#78)
Re: better atomics - v0.5

On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:

I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.

Meh. It's quite common that read/load do not have barrier semantics. The
majority of read/write users won't need barriers and it'll be expensive
to provide them with them.

So in such a case they shouldn't use atomic API and rather access
directly, I think part of the purpose to use atomic ops is also that they
should provide synchronisation for read/write among processes, earlier
we don't have any such facility, so wherever required we have no option
but to use barriers to achieve the same.

One more point here, in patch 64-bit atomic read/write are
implemented using *_exchange_* API, won't this automatically ensure
that 64-bit versions have barrier support?

+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
..
+ pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
..
+}

b) It's only supported from vista onwards. Afaik we still

support

XP.

Well, we have quite good evidence for it working by knowing that
postgres has in the past worked on xp? If you have a better idea, send
in a patch for today's barrier.h and I'll adopt that.

Here the only point was to check if it is better to use version of
InterLocked... API which is a recommended for the case of Itanium-
based processor and will behave same for others. If you have any
inclination/suggestion for the same, then I can try for it.

In this case, I have a question for you.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

Hmm.. Do you think in such a case, it would have gone in below
define:

#elif defined(__INTEL_COMPILER)
/*
* icc defines __GNUC__, but doesn't support gcc's inline asm syntax
*/
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif

Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
acc, I don't see why?

Sorry, I don't understand what you mean by above?
The point I wanted to clarify here was that is it possible that above
defines lead to different definitions for Itanium on windows?

But they should go into atomics-generic-acc.h.

Okay, but in your current patch (0001-Basic-atomic-ops-implementation),
they seem to be in different files.

-#define pg_compiler_barrier() __memory_barrier()

Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?

#if defined(__INTEL_COMPILER)
#define pg_compiler_barrier_impl() __memory_barrier()
#else
#define pg_compiler_barrier_impl() __asm__ __volatile__("" :::

"memory")

#endif

There earlier was a typo defining pg_compiler_barrier instead of
pg_compiler_barrier_impl for intel, maybe that's what you were referring

to?

I have tried to search for this in patch
file 0001-Basic-atomic-ops-implementation
and found that it has been removed from barrier.h but not added anywhere
else.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#84Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#83)
Re: better atomics - v0.5

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, "the results of the operation are available before
the
results of any operation that appears after it in code" which means it
applies
for both load and stores. Definition of read barrier just ensures loads.

So will be right to declare like above in comments?

b. I think it adheres to Acquire semantics for platforms where
that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Is it right to declare that generic version of function adheres to
Acquire semantics.

2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) &ptr->sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

a. In above usage (ptr->sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API
(InterlockedCompareExchange)
expecting volatile.

b. Also shouldn't this implementation be moved to atomics-generic-msvc.h

3.
/*
* pg_atomic_unlocked_test_flag - TAS()
*
* No barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);

a. There is no setting in this, so using TAS in function comments
seems to be wrong.
b. BTW, I am not able see this function in C11 specs.

4.
#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}

#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return (bool) pg_atomic_read_u32_impl(ptr);
}

Is there any reason to keep these two implementations different?

5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr->value == 0;
}
#endif

Don't we need to compare it with 1 instead of 0?

6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
* Can't do this in the semaphore based implementation - always return
* true. Do this in the header so compilers can optimize the test away.
*/
return true;
}

Why we can't implement this in semaphore based implementation?

7.
/*
* pg_atomic_clear_flag - release lock set by TAS()
*
* Release/write barrier semantics.
*/
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);

a. Are Release and write barriers equivalent?

With release semantics, the results of the operation are available after the
results of any operation that appears before it in code.
A write barrier acts as a compiler barrier, and orders stores.

I think both definitions are not same.

b. IIUC then current code doesn't have release semantics for unlock/clear.
We are planing to introduce it in this patch, also this doesn't follow the
specs which says that clear should have memory_order_seq_cst ordering
semantics.

As per its current usage in patch (S_UNLOCK), it seems reasonable
to have *release* semantics for this API, however I think if we use it for
generic purpose then it might not be right to define it with Release
semantics.

8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}

Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the same.
"With release semantics, the results of the operation are available after
the results of any operation that appears before it in code."

9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}

What is the reason to implement exchange as compare-and-exchange, at least
for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.

10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);

The function name and input value seems to be inconsistent.
The function name contains *_u32*, however the input value is *int32*.

11.
Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are
implemented using compare_exchange routines, don't we have any native API's
which we can use for implementation.

12.
I am not able to see API's pg_atomic_add_fetch_u32(),
pg_atomic_sub_fetch_u32()
in C11 atomic ops specs, do you need it for something specific?

13.
+ * The non-intrinsics version is only available in vista upwards, so use
the
+ * intrisc version.

spelling of intrinsic is wrong.

14.
* pg_memory_barrier - prevent the CPU from reordering memory access
*
* A memory barrier must act as a compiler barrier,

Is it ensured in all cases that pg_memory_barrier implies compiler barrier
as well?
Example for msvc case, the specs doesn't seem to guarntee it.

I understand that this rule (or assumption) is carried forward from existing
implementation, however I am not aware if it is true for all cases, so
thought
of mentioning here even though we don't want to do anything considering it
an
existing behaviour.

Links:
http://en.wikipedia.org/wiki/Memory_barrier
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#85Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#84)
Re: better atomics - v0.5

On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, "the results of the operation are available before
the
results of any operation that appears after it in code" which means it
applies
for both load and stores. Definition of read barrier just ensures loads.

So will be right to declare like above in comments?

Yeah. Barriers can be by the operations they keep from being
reordered (read, write, or both) and by the direction in which they
prevent reordering (acquire, release, or both). So in theory there
are 9 combinations:

read barrier (both directions)
read barrier (acquire)
read barrier (release)
write barrier (both directions)
write barrier (acquire)
write barrier (both directions)
full barrier (both directions)
full barrier (acquire)
full barrier (release)

To make things more confusing, a read barrier plus a write barrier
need not equal a full barrier. Read barriers prevent reads from being
reordered with other reads, and write barriers keep writes from being
reordered with other writes, but neither prevents a read from being
reordered relative to a write. A full barrier, however, must prevent
all reordering: read/read, read/write, write/read, and write/write.

Clarity here is essential.

barrier.h only proposed to implement the following:

read barrier (both directions), write barrier (both directions), full
barrier (both directions)

The reason I did that is because it seemed to be more or less what
Linux was doing, and it's generally suitable for lock-less algorithms.
The acquire and release semantics come into play specifically when
you're using barriers to implement locking primitives, which is isn't
what I was trying to enable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#85)
Re: better atomics - v0.5

On Wed, Jul 9, 2014 at 8:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, "the results of the operation are available

before

the
results of any operation that appears after it in code" which means it
applies
for both load and stores. Definition of read barrier just ensures loads.

So will be right to declare like above in comments?

Yeah. Barriers can be by the operations they keep from being
reordered (read, write, or both) and by the direction in which they
prevent reordering (acquire, release, or both). So in theory there
are 9 combinations:

read barrier (both directions)
read barrier (acquire)
read barrier (release)
write barrier (both directions)
write barrier (acquire)
write barrier (both directions)
full barrier (both directions)
full barrier (acquire)
full barrier (release)

With these definitions, I think it will fall into *full barrier (acquire)*
category as it needs to ensure that no reordering should happen
for both load and store. As an example, we can refer its
implementation for msvc which ensures that it follows full memory
barrier semantics.
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

To make things more confusing, a read barrier plus a write barrier
need not equal a full barrier. Read barriers prevent reads from being
reordered with other reads, and write barriers keep writes from being
reordered with other writes, but neither prevents a read from being
reordered relative to a write. A full barrier, however, must prevent
all reordering: read/read, read/write, write/read, and write/write.

Clarity here is essential.

barrier.h only proposed to implement the following:

read barrier (both directions), write barrier (both directions), full
barrier (both directions)

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.

I think the current definitions of barriers is as per needs of usage in
PostgreSQL and not by referring standard (am I missing something
here), however as you explained the definitions are quite clear.

The reason I did that is because it seemed to be more or less what
Linux was doing, and it's generally suitable for lock-less algorithms.
The acquire and release semantics come into play specifically when
you're using barriers to implement locking primitives, which is isn't
what I was trying to enable.

Now by implementing atomic-ops, I think we are moving more towards
lock-less algorithms, so I am not sure if we just want to adhere to
existing definitions as you listed above (what current barrier.h
implements) and implement accordingly or we can extend the existing
definitions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#87Martijn van Oosterhout
kleptog@svana.org
In reply to: Amit Kapila (#86)
Re: better atomics - v0.5

On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.

Not exactly, both compilers and processors can rearrange loads and
stores. Because the architecture most developers work on (x86) has
such a strong memory model (it's goes to lot of effort to hide
reordering) people are under the impression that it's only the compiler
you need to worry about, but that's not true for any other
architechture.

Memory barriers/fences are on the whole discouraged if you can use
acquire/release semantics because the latter are faster and much easier
to optimise for.

I strongly recommend the "Atomic Weapons" talk on the C11 memory model
to help understand how they work. As a bonus it includes correct
implementations for the major architectures.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#88Amit Kapila
amit.kapila16@gmail.com
In reply to: Martijn van Oosterhout (#87)
Re: better atomics - v0.5

On Sat, Jul 12, 2014 at 1:26 AM, Martijn van Oosterhout <kleptog@svana.org>
wrote:

On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.

Not exactly, both compilers and processors can rearrange loads and
stores. Because the architecture most developers work on (x86) has
such a strong memory model (it's goes to lot of effort to hide
reordering) people are under the impression that it's only the compiler
you need to worry about, but that's not true for any other
architechture.

I am not saying that we need only barriers to avoid reordering due to
compiler, rather my point was that we need to avoid reordering due to
both compiler and processor, however ways to achieve it require different
API's

Memory barriers/fences are on the whole discouraged if you can use
acquire/release semantics because the latter are faster and much easier
to optimise for.

Do all processors support instructions for memory barriers with acquire,
release semantics?

I strongly recommend the "Atomic Weapons" talk on the C11 memory model
to help understand how they work. As a bonus it includes correct
implementations for the major architectures.

Yes that will be a good if we can can implement as per C11 memory model and
thats what I am referring while reviewing this patch, however even if that
is not
possible or difficult to achieve in all cases, it is worth to have a
discussion for
memory model used in current API's implemented in patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#89Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#84)
Re: better atomics - v0.5

Hi Amit,

On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.

Right. It was mean as Acquire (thus including read barrier) semantics. I
guess I'll better update README.barrier to have definitions of all barriers.

b. I think it adheres to Acquire semantics for platforms where
that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Is it right to declare that generic version of function adheres to
Acquire semantics.

Implementing stronger semantics than required should always be
fine. There's simply no sane way to work with the variety of platforms
we support in any other way.

2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) &ptr->sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Where's that from? I can't recall adding a line of code like that?

a. In above usage (ptr->sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API
(InterlockedCompareExchange)
expecting volatile.

Upgrading a pointer to volatile is defined, so I don't see the problem?

3.
/*
* pg_atomic_unlocked_test_flag - TAS()
*
* No barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);

a. There is no setting in this, so using TAS in function comments
seems to be wrong.

Good point.

b. BTW, I am not able see this function in C11 specs.

So? It's needed for a sensible implementation of spinlocks ontop of
atomics.

4.
#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}

#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return (bool) pg_atomic_read_u32_impl(ptr);
}

Is there any reason to keep these two implementations different?

No, will change.

5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr->value == 0;
}
#endif

Don't we need to compare it with 1 instead of 0?

Why? It returns true if the lock is free, makes sense imo? Will add
comments to atomics.h

6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
* Can't do this in the semaphore based implementation - always return
* true. Do this in the header so compilers can optimize the test away.
*/
return true;
}

Why we can't implement this in semaphore based implementation?

Because semaphores don't provide the API for it?

7.
/*
* pg_atomic_clear_flag - release lock set by TAS()
*
* Release/write barrier semantics.
*/
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);

a. Are Release and write barriers equivalent?

Same answer as above. Meant as "Release (thus including write barrier) semantics"

b. IIUC then current code doesn't have release semantics for unlock/clear.

If you're referring to s_lock.h with 'current code', yes it should have:
* On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
* S_UNLOCK() macros must further include hardware-level memory fence
* instructions to prevent similar re-ordering at the hardware level.
* TAS() and TAS_SPIN() must guarantee that loads and stores issued after
* the macro are not executed until the lock has been obtained. Conversely,
* S_UNLOCK() must guarantee that loads and stores issued before the macro
* have been executed before the lock is released.

We are planing to introduce it in this patch, also this doesn't follow the
specs which says that clear should have memory_order_seq_cst ordering
semantics.

Making it guarantee full memory barrier semantics is a major performance
loss on x86-64, so I don't want to do that.

Also there is atomic_flag_test_and_set_explicit()...

As per its current usage in patch (S_UNLOCK), it seems reasonable
to have *release* semantics for this API, however I think if we use it for
generic purpose then it might not be right to define it with Release
semantics.

I can't really see a user where it's not what you want. And if there is
- well, it can make the semantics stronger if it needs that.

8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}

Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the
same.

Yes, a memory barrier guarantees a release semantics. It guarantees
more, but that's not a problem.

9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}

What is the reason to implement exchange as compare-and-exchange, at least
for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.

Because Robert and Tom insist that we shouldn't add more operations
employing hardware features. I think that's ok for now, we can always
add more capabilities later.

10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);

The function name and input value seems to be inconsistent.
The function name contains *_u32*, however the input value is *int32*.

A u32 is implemented by adding/subtracting a signed int. There's some
platforms why that's the only API provided by intrinsics and it's good
enough for pretty much everything.

11.
Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are
implemented using compare_exchange routines, don't we have any native API's
which we can use for implementation.

Same reason as for 9).

12.
I am not able to see API's pg_atomic_add_fetch_u32(),
pg_atomic_sub_fetch_u32()
in C11 atomic ops specs, do you need it for something specific?

Yes, it's already used in the lwlocks code and it's provided by several
other atomics libraries. I don't really see a reason not to provide it,
especially as some platforms could implement it more efficiently.

14.
* pg_memory_barrier - prevent the CPU from reordering memory access
*
* A memory barrier must act as a compiler barrier,

Is it ensured in all cases that pg_memory_barrier implies compiler barrier
as well?

Yes.

Example for msvc case, the specs doesn't seem to guarntee it.

I think it actually indirectly guarantees it, but I'm on a plane and
can't check :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#86)
Re: better atomics - v0.5

On 2014-07-10 08:46:55 +0530, Amit Kapila wrote:

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and
full memory barriers are defined to avoid reordering due to
processors.

No, that's not the case. There's several processors where write/read
barriers are an actual thing on the hardware level.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#89)
Re: better atomics - v0.5

On Mon, Jul 14, 2014 at 12:47 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:

On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Further review of patch:
1.
/*
* pg_atomic_test_and_set_flag - TAS()
*
* Acquire/read barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.

Right. It was mean as Acquire (thus including read barrier) semantics.

Okay, I got confused by reading comments, may be saying the same
explicitly in comments is better.

I guess I'll better update README.barrier to have definitions of all
barriers.

I think updating about the reasons behind using specific
barriers for atomic operations defined by PG can be useful
for people who want to understand or use the atomic op API's.

b. I think it adheres to Acquire semantics for platforms where
that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Is it right to declare that generic version of function adheres to
Acquire semantics.

Implementing stronger semantics than required should always be
fine. There's simply no sane way to work with the variety of platforms
we support in any other way.

Agreed, may be we can have a note somewhere either in code or
readme to mention about it, if you think that can be helpful.

2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) &ptr->sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Where's that from?

Its there in s_lock.h. Refer below code:
#ifdef WIN32_ONLY_COMPILER
typedef LONG slock_t;

#define HAS_TEST_AND_SET
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

I can't recall adding a line of code like that?

It is not added by your patch but used in your patch.

I am not sure if that is what excatly you want atomic API to define
for WIN32, but I think it is picking this definition. I have one question
here, if the value of sema is other than 1 or 0, then it won't set the flag
and not sure what it will return (ex. if the value is negative), so is this
implementation completely sane?

a. In above usage (ptr->sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API
(InterlockedCompareExchange)
expecting volatile.

Upgrading a pointer to volatile is defined, so I don't see the problem?

Thats okay. However all other structure definitions use it as
volatile, example for atomics-arch-amd64.h it is defined as follows:
#define PG_HAVE_ATOMIC_FLAG_SUPPORT
typedef struct pg_atomic_flag
{
volatile char value;
} pg_atomic_flag;

So is there any harm if we keep this definition in sync with others?

3.
/*
* pg_atomic_unlocked_test_flag - TAS()
*
* No barrier semantics.
*/
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);

a. There is no setting in this, so using TAS in function comments
seems to be wrong.

Good point.

b. BTW, I am not able see this function in C11 specs.

So? It's needed for a sensible implementation of spinlocks ontop of
atomics.

Okay got the point, do you think it makes sense to have a common
agreement/consensus w.r.t which API's are necessary as a first
cut. For me as a reviewer, if the API is implemented as per specs or
what it is desired to then we can have it, however I am not sure if there
is general consensus or at least some people agreed to set of API's
which are required for first cut, have I missed some conclusion/discussion
about it.

5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr->value == 0;
}
#endif

Don't we need to compare it with 1 instead of 0?

Why? It returns true if the lock is free, makes sense imo?

Other implementation in atomics-generic.h seems to implement it other
way
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}

Will add comments to atomics.h

I think that will be helpful.

6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
* Can't do this in the semaphore based implementation - always return
* true. Do this in the header so compilers can optimize the test away.
*/
return true;
}

Why we can't implement this in semaphore based implementation?

Because semaphores don't provide the API for it?

Okay, but it is not clear from comments why this test should always
return true, basically I am not able to think why incase of missing API
returning true is correct behaviour for this API?

7.
/*
* pg_atomic_clear_flag - release lock set by TAS()
*
* Release/write barrier semantics.
*/
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);

a. Are Release and write barriers equivalent?

Same answer as above. Meant as "Release (thus including write barrier)

semantics"

b. IIUC then current code doesn't have release semantics for

unlock/clear.

If you're referring to s_lock.h with 'current code', yes it should have:

I was referring to next point (point number 8) which you have clarified
below.

8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}

Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the
same.

Yes, a memory barrier guarantees a release semantics. It guarantees
more, but that's not a problem.

You are right.

9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32

xchg_)

{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}

What is the reason to implement exchange as compare-and-exchange, at

least

for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.

Because Robert and Tom insist that we shouldn't add more operations
employing hardware features. I think that's ok for now, we can always
add more capabilities later.

No issues, may be a comment indicating some form of reason would be good.

10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);

The function name and input value seems to be inconsistent.
The function name contains *_u32*, however the input value is *int32*.

A u32 is implemented by adding/subtracting a signed int. There's some
platforms why that's the only API provided by intrinsics and it's good
enough for pretty much everything.

As such no issues, but just wondering is there any reason to prefer
*_u32 instead of simple _32 or *32.

12.
I am not able to see API's pg_atomic_add_fetch_u32(),
pg_atomic_sub_fetch_u32()
in C11 atomic ops specs, do you need it for something specific?

Yes, it's already used in the lwlocks code and it's provided by several
other atomics libraries. I don't really see a reason not to provide it,
especially as some platforms could implement it more efficiently.

Okay, I find below implementation in your patch. Won't it add the value
twice?

+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U32) &&
defined(PG_HAS_ATOMIC_FETCH_ADD_U32)
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif

14.
* pg_memory_barrier - prevent the CPU from reordering memory access
*
* A memory barrier must act as a compiler barrier,

Is it ensured in all cases that pg_memory_barrier implies compiler

barrier

as well?

Yes.

Example for msvc case, the specs doesn't seem to guarntee it.

I think it actually indirectly guarantees it, but I'm on a plane and
can't check :)

No problem let me know when you are comfortable, this is just for
clarification.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#92Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#36)
Re: better atomics - v0.5

Hi Andres,

Are you planning to continue working on this? Summarizing the discussion
so far:

* Robert listed a bunch of little cleanup tasks
(/messages/by-id/CA+TgmoZShVVqjaKUL6P3kDvZVPibtgkzoti3M+Fvvjg5V+XJ0A@mail.gmail.com).
Amit posted yet more detailed commends.

* We talked about changing the file layout. I think everyone is happy
with your proposal here:
/messages/by-id/20140702131729.GP21169@awork2.anarazel.de,
with an overview description of what goes where
(/messages/by-id/CA+TgmoYuxfsm2dSy48=JgUTRh1mozrvmjLHgqrFkU7_WPV-xLA@mail.gmail.com)

* Talked about nested spinlocks. The consensus seems to be to (continue
to) forbid nested spinlocks, but allow atomic operations while holding a
spinlock. When atomics are emulated with spinlocks, it's OK to acquire
the emulation spinlock while holding another spinlock.

* Talked about whether emulating atomics with spinlocks is a good idea.
You posted performance results showing that at least with the patch to
use atomics to implement LWLocks, the emulation performs more or less
the same as the current spinlock-based implementation. I believe
everyone was more or less satisfied with that.

So ISTM we have consensus that the approach to spinlock emulation and
nesting stuff is OK. To finish the patch, you'll just need to address
the file layout and the laundry list of other little details that have
been raised.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#92)
Re: better atomics - v0.5

Hi,

On 2014-08-20 12:43:05 +0300, Heikki Linnakangas wrote:

Are you planning to continue working on this?

Yes, I am! I've recently been on holiday and now I'm busy with catching
up with everything that has happened since. I hope to have a next
version ready early next week.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#1)
5 attachment(s)
Re: better atomics - v0.6

Hi,

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Most notable changes:
* Lots of comment improvements
* code moved out of storage/ into port/
* separated the s_lock.h changes into its own commit
* merged i386/amd64 into one file
* fixed lots of the little details Amit noticed
* fixed lots of XXX/FIXMEs
* rebased to today's master
* tested various gcc/msvc versions
* extended the regression tests
* ...

The patches:
0001: The actual atomics API
0002: Implement s_lock.h support ontop the atomics API. Existing
implementations continue to be used unless
FORCE_ATOMICS_BASED_SPINLOCKS is defined
0003-0005: Not proposed for review here. Just included because code
actually using the atomics make testing them easier.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-a-basic-atomic-ops-API-abstracting-away-platform.patchtext/x-patch; charset=iso-8859-1Download
From e48df87841267f531e4fedb9be63f702f66c5a28 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2014 10:53:20 +0100
Subject: [PATCH 1/5] Add a basic atomic ops API abstracting away
 platform/architecture details.

Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.

For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was slightly faster than the old purely spinlock using
implementation.

The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integer types. If the
platform supports 64 bit unsigned integers those are also provided.

To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.

The implementation for various versions of gcc on linux and msvc have
been tested. Additional existing stub implementations for
* Sun/Oracle sunpro compiler
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.

As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.

Author: Andres Freund
Reviewed-By: Amit Kapila and Robert Haas
---
 config/c-compiler.m4                             |  98 ++++
 configure                                        | 272 +++++++++--
 configure.in                                     |  32 +-
 src/backend/main/main.c                          |   1 +
 src/backend/port/Makefile                        |   2 +-
 src/backend/port/atomics.c                       | 127 ++++++
 src/backend/storage/lmgr/spin.c                  |  23 +-
 src/include/c.h                                  |   7 +
 src/include/pg_config.h.in                       |  21 +-
 src/include/pg_config.h.win32                    |   3 +
 src/include/pg_config_manual.h                   |   8 +
 src/include/port/atomics.h                       | 548 +++++++++++++++++++++++
 src/include/port/atomics/arch-arm.h              |  25 ++
 src/include/port/atomics/arch-hppa.h             |  17 +
 src/include/port/atomics/arch-ia64.h             |  26 ++
 src/include/port/atomics/arch-ppc.h              |  26 ++
 src/include/port/atomics/arch-x86.h              | 241 ++++++++++
 src/include/port/atomics/fallback.h              | 132 ++++++
 src/include/port/atomics/generic-acc.h           |  99 ++++
 src/include/port/atomics/generic-gcc.h           | 236 ++++++++++
 src/include/port/atomics/generic-msvc.h          | 103 +++++
 src/include/port/atomics/generic-sunpro.h        |  81 ++++
 src/include/port/atomics/generic-xlc.h           | 103 +++++
 src/include/port/atomics/generic.h               | 401 +++++++++++++++++
 src/include/storage/barrier.h                    | 155 +------
 src/include/storage/s_lock.h                     |   5 +-
 src/test/regress/expected/lock.out               |   8 +
 src/test/regress/input/create_function_1.source  |   5 +
 src/test/regress/output/create_function_1.source |   4 +
 src/test/regress/regress.c                       | 233 ++++++++++
 src/test/regress/sql/lock.sql                    |   5 +
 31 files changed, 2836 insertions(+), 211 deletions(-)
 create mode 100644 src/backend/port/atomics.c
 create mode 100644 src/include/port/atomics.h
 create mode 100644 src/include/port/atomics/arch-arm.h
 create mode 100644 src/include/port/atomics/arch-hppa.h
 create mode 100644 src/include/port/atomics/arch-ia64.h
 create mode 100644 src/include/port/atomics/arch-ppc.h
 create mode 100644 src/include/port/atomics/arch-x86.h
 create mode 100644 src/include/port/atomics/fallback.h
 create mode 100644 src/include/port/atomics/generic-acc.h
 create mode 100644 src/include/port/atomics/generic-gcc.h
 create mode 100644 src/include/port/atomics/generic-msvc.h
 create mode 100644 src/include/port/atomics/generic-sunpro.h
 create mode 100644 src/include/port/atomics/generic-xlc.h
 create mode 100644 src/include/port/atomics/generic.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 802f553..2cf74bb 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -300,3 +300,101 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_PROG_CC_LDFLAGS_OPT
+
+# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(char),
+# and define HAVE_GCC__SYNC_CHAR_TAS
+#
+# NB: There are platforms where test_and_set is available but compare_and_swap
+# is not, so test this separately.
+# NB: Some platforms only do 32bit tas, others only do 8bit tas. Test both.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_CHAR_TAS],
+[AC_CACHE_CHECK(for builtin __sync char locking functions, pgac_cv_gcc_sync_char_tas,
+[AC_TRY_LINK([],
+  [char lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_char_tas="yes"],
+  [pgac_cv_gcc_sync_char_tas="no"])])
+if test x"$pgac_cv_gcc_sync_char_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_CHAR_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(char *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(),
+# and define HAVE_GCC__SYNC_INT32_TAS
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_TAS],
+[AC_CACHE_CHECK(for builtin __sync int32 locking functions, pgac_cv_gcc_sync_int32_tas,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_int32_tas="yes"],
+  [pgac_cv_gcc_sync_int32_tas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 32bit
+# types, and define HAVE_GCC__SYNC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __sync int32 atomic operations, pgac_cv_gcc_sync_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);],
+  [pgac_cv_gcc_sync_int32_cas="yes"],
+  [pgac_cv_gcc_sync_int32_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int *, int, int).])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_CAS
+
+# PGAC_HAVE_GCC__SYNC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 64bit
+# types, and define HAVE_GCC__SYNC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __sync int64 atomic operations, pgac_cv_gcc_sync_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);],
+  [pgac_cv_gcc_sync_int64_cas="yes"],
+  [pgac_cv_gcc_sync_int64_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64).])
+fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
+# types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int32_cas="yes"],
+  [pgac_cv_gcc_atomic_int32_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __atomic_compare_exchange_n() for 64bit
+# types, and define HAVE_GCC__ATOMIC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int64 atomic operations, pgac_cv_gcc_atomic_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int64_cas="yes"],
+  [pgac_cv_gcc_atomic_int64_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
diff --git a/configure b/configure
index dac8e49..79bf764 100755
--- a/configure
+++ b/configure
@@ -802,6 +802,7 @@ enable_nls
 with_pgport
 enable_rpath
 enable_spinlocks
+enable_atomics
 enable_debug
 enable_profiling
 enable_coverage
@@ -1470,6 +1471,7 @@ Optional Features:
   --disable-rpath         do not embed shared library search path in
                           executables
   --disable-spinlocks     do not use spinlocks
+  --disable-atomics       do not use atomic operations
   --enable-debug          build with debugging symbols (-g)
   --enable-profiling      build with profiling enabled
   --enable-coverage       build with coverage testing instrumentation
@@ -3146,6 +3148,33 @@ fi
 
 
 #
+# Atomic operations
+#
+
+
+# Check whether --enable-atomics was given.
+if test "${enable_atomics+set}" = set; then :
+  enableval=$enable_atomics;
+  case $enableval in
+    yes)
+      :
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --enable-atomics option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  enable_atomics=yes
+
+fi
+
+
+
+#
 # --enable-debug adds -g to compiler flags
 #
 
@@ -8349,6 +8378,17 @@ $as_echo "$as_me: WARNING:
 *** Not using spinlocks will cause poor performance." >&2;}
 fi
 
+if test "$enable_atomics" = yes; then
+
+$as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
+
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
+*** Not using atomic operations will cause poor performance." >&5
+$as_echo "$as_me: WARNING:
+*** Not using atomic operations will cause poor performance." >&2;}
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing gss_init_sec_context" >&5
@@ -12154,40 +12194,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if ${pgac_cv_gcc_int_atomics+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_gcc_int_atomics="yes"
-else
-  pgac_cv_gcc_int_atomics="no"
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-$as_echo "#define HAVE_GCC_INT_ATOMICS 1" >>confdefs.h
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -13711,6 +13717,204 @@ _ACEOF
 fi
 
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync char locking functions" >&5
+$as_echo_n "checking for builtin __sync char locking functions... " >&6; }
+if ${pgac_cv_gcc_sync_char_tas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+char lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_char_tas="yes"
+else
+  pgac_cv_gcc_sync_char_tas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_char_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_char_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_char_tas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_CHAR_TAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int32 locking functions" >&5
+$as_echo_n "checking for builtin __sync int32 locking functions... " >&6; }
+if ${pgac_cv_gcc_sync_int32_tas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_tas="yes"
+else
+  pgac_cv_gcc_sync_int32_tas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_TAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int32 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_cas="yes"
+else
+  pgac_cv_gcc_sync_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int64 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int64_cas="yes"
+else
+  pgac_cv_gcc_sync_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT64_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int32 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int32_cas="yes"
+else
+  pgac_cv_gcc_atomic_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int64 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int64_cas="yes"
+else
+  pgac_cv_gcc_atomic_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT64_CAS 1" >>confdefs.h
+
+fi
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/configure.in b/configure.in
index 1392277..285d908 100644
--- a/configure.in
+++ b/configure.in
@@ -179,6 +179,12 @@ PGAC_ARG_BOOL(enable, spinlocks, yes,
               [do not use spinlocks])
 
 #
+# Atomic operations
+#
+PGAC_ARG_BOOL(enable, atomics, yes,
+              [do not use atomic operations])
+
+#
 # --enable-debug adds -g to compiler flags
 #
 PGAC_ARG_BOOL(enable, debug, no,
@@ -936,6 +942,13 @@ else
 *** Not using spinlocks will cause poor performance.])
 fi
 
+if test "$enable_atomics" = yes; then
+  AC_DEFINE(HAVE_ATOMICS, 1, [Define to 1 if you have atomics.])
+else
+  AC_MSG_WARN([
+*** Not using atomic operations will cause poor performance.])
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     AC_SEARCH_LIBS(gss_init_sec_context, [gssapi_krb5 gss 'gssapi -lkrb5 -lcrypto'], [],
@@ -1467,17 +1480,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1746,6 +1748,14 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+PGAC_HAVE_GCC__SYNC_CHAR_TAS
+PGAC_HAVE_GCC__SYNC_INT32_TAS
+PGAC_HAVE_GCC__SYNC_INT32_CAS
+PGAC_HAVE_GCC__SYNC_INT64_CAS
+PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 413c982..c51b391 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -30,6 +30,7 @@
 #include "common/username.h"
 #include "postmaster/postmaster.h"
 #include "storage/barrier.h"
+#include "storage/s_lock.h"
 #include "storage/spin.h"
 #include "tcop/tcopprot.h"
 #include "utils/help_config.h"
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index 15ea873..c6b1d20 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
+OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
new file mode 100644
index 0000000..32255b0
--- /dev/null
+++ b/src/backend/port/atomics.c
@@ -0,0 +1,127 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.c
+ *	   Non-Inline parts of the atomics implementation
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/port/atomics.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/*
+ * We want the functions below to be inline; but if the compiler doesn't
+ * support that, fall back on providing them as regular functions.	See
+ * STATIC_IF_INLINE in c.h.
+ */
+#define ATOMICS_INCLUDE_DEFINITIONS
+
+#include "port/atomics.h"
+#include "storage/spin.h"
+
+#ifdef PG_HAVE_MEMORY_BARRIER_EMULATION
+void
+pg_spinlock_barrier(void)
+{
+	S_LOCK(&dummy_spinlock);
+	S_UNLOCK(&dummy_spinlock);
+}
+#endif
+
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+#ifndef HAVE_SPINLOCKS
+	/*
+	 * NB: If we're using semaphore based TAS emulation, be careful to use a
+	 * separate set of semaphores. Otherwise we'd get in trouble if a atomic
+	 * var would be manipulated while spinlock is held.
+	 */
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+}
+
+bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return TAS((slock_t *) &ptr->sema);
+}
+
+void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	S_UNLOCK((slock_t *) &ptr->sema);
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+	/*
+	 * If we're using semaphore based atomic flags, be careful about nested
+	 * usage of atomics while a spinlock is held.
+	 */
+#ifndef HAVE_SPINLOCKS
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+	ptr->value = val_;
+}
+
+bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool ret;
+	/*
+	 * Do atomic op under a spinlock. It might look like we could just skip
+	 * the cmpxchg if the lock isn't available, but that'd just emulate a
+	 * 'weak' compare and swap. I.e. one that allows spurious failures. Since
+	 * several algorithms rely on a strong variant and that is efficiently
+	 * implementable on most major architectures let's emulate it here as
+	 * well.
+	 */
+	SpinLockAcquire((slock_t *) &ptr->sema);
+
+	/* perform compare/exchange logic*/
+	ret = ptr->value == *expected;
+	*expected = ptr->value;
+	if (ret)
+		ptr->value = newval;
+
+	/* and release lock */
+	SpinLockRelease((slock_t *) &ptr->sema);
+
+	return ret;
+}
+
+uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 oldval;
+	SpinLockAcquire((slock_t *) &ptr->sema);
+	oldval = ptr->value;
+	ptr->value += add_;
+	SpinLockRelease((slock_t *) &ptr->sema);
+	return oldval;
+}
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index e5a9d35..a382e77 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -67,7 +67,7 @@ SpinlockSemas(void)
 int
 SpinlockSemas(void)
 {
-	return NUM_SPINLOCK_SEMAPHORES;
+	return NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES;
 }
 
 /*
@@ -77,8 +77,9 @@ extern void
 SpinlockSemaInit(PGSemaphore spinsemas)
 {
 	int			i;
+	int			nsemas = SpinlockSemas();
 
-	for (i = 0; i < NUM_SPINLOCK_SEMAPHORES; ++i)
+	for (i = 0; i < nsemas; ++i)
 		PGSemaphoreCreate(&spinsemas[i]);
 	SpinlockSemaArray = spinsemas;
 }
@@ -88,32 +89,32 @@ SpinlockSemaInit(PGSemaphore spinsemas)
  */
 
 void
-s_init_lock_sema(volatile slock_t *lock)
+s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
-	static int	counter = 0;
+   static int  counter = 0;
 
-	*lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
+   *lock = (++counter) % NUM_SPINLOCK_SEMAPHORES;
 }
 
 void
 s_unlock_sema(volatile slock_t *lock)
 {
-	PGSemaphoreUnlock(&SpinlockSemaArray[*lock]);
+   PGSemaphoreUnlock(&SpinlockSemaArray[*lock]);
 }
 
 bool
 s_lock_free_sema(volatile slock_t *lock)
 {
-	/* We don't currently use S_LOCK_FREE anyway */
-	elog(ERROR, "spin.c does not support S_LOCK_FREE()");
-	return false;
+   /* We don't currently use S_LOCK_FREE anyway */
+   elog(ERROR, "spin.c does not support S_LOCK_FREE()");
+   return false;
 }
 
 int
 tas_sema(volatile slock_t *lock)
 {
-	/* Note that TAS macros return 0 if *success* */
-	return !PGSemaphoreTryLock(&SpinlockSemaArray[*lock]);
+   /* Note that TAS macros return 0 if *success* */
+   return !PGSemaphoreTryLock(&SpinlockSemaArray[*lock]);
 }
 
 #endif   /* !HAVE_SPINLOCKS */
diff --git a/src/include/c.h b/src/include/c.h
index 2ceaaf6..ffbed89 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -862,6 +862,13 @@ typedef NameData *Name;
 #define STATIC_IF_INLINE
 #endif   /* PG_USE_INLINE */
 
+#ifdef PG_USE_INLINE
+#define STATIC_IF_INLINE_DECLARE static inline
+#else
+#define STATIC_IF_INLINE_DECLARE extern
+#endif   /* PG_USE_INLINE */
+
+
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
  * ----------------------------------------------------------------
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 1ce37c8..c261af7 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -87,6 +87,9 @@
 /* Define to 1 if you have the `append_history' function. */
 #undef HAVE_APPEND_HISTORY
 
+/* Define to 1 if you have atomics. */
+#undef HAVE_ATOMICS
+
 /* Define to 1 if you have the `cbrt' function. */
 #undef HAVE_CBRT
 
@@ -173,8 +176,24 @@
 /* Define to 1 if your compiler understands __FUNCTION__. */
 #undef HAVE_FUNCNAME__FUNCTION
 
+/* Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int). */
+#undef HAVE_GCC__ATOMIC_INT32_CAS
+
+/* Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).
+   */
+#undef HAVE_GCC__ATOMIC_INT64_CAS
+
+/* Define to 1 if you have __sync_lock_test_and_set(char *) and friends. */
+#undef HAVE_GCC__SYNC_CHAR_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int *, int, int). */
+#undef HAVE_GCC__SYNC_INT32_CAS
+
 /* Define to 1 if you have __sync_lock_test_and_set(int *) and friends. */
-#undef HAVE_GCC_INT_ATOMICS
+#undef HAVE_GCC__SYNC_INT32_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64). */
+#undef HAVE_GCC__SYNC_INT64_CAS
 
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index bce67a2..05941e6 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -334,6 +334,9 @@
 /* Define to 1 if you have spinlocks. */
 #define HAVE_SPINLOCKS 1
 
+/* Define to 1 if you have atomics. */
+#define HAVE_ATOMICS 1
+
 /* Define to 1 if you have the `srandom' function. */
 /* #undef HAVE_SRANDOM */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index d78f38e..195fb26 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -65,6 +65,14 @@
 #define NUM_SPINLOCK_SEMAPHORES		1024
 
 /*
+ * When we have neither spinlocks nor atomic operations support we're
+ * implementing atomic operations on top of spinlock on top of semaphores. To
+ * be safe against atomic operations while holding a spinlock separate
+ * semaphores have to be used.
+ */
+#define NUM_ATOMICS_SEMAPHORES		64
+
+/*
  * Define this if you want to allow the lo_import and lo_export SQL
  * functions to be executed by ordinary users.  By default these
  * functions are only available to the Postgres superuser.  CAUTION:
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
new file mode 100644
index 0000000..8e075af
--- /dev/null
+++ b/src/include/port/atomics.h
@@ -0,0 +1,548 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.h
+ *	  Atomic operations.
+ *
+ * Hardware and compiler dependent functions for manipulating memory
+ * atomically and dealing with cache coherency. Used to implement locking
+ * facilities and lockless algorithms/data structures.
+ *
+ * To bring up postgres on a platform/compiler at the very least
+ * implementations for the following operations should be provided:
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
+ * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
+ *
+ * There exist generic, hardware independent, implementations for several
+ * compilers which might be sufficient, although possibly not optimal, for a
+ * new platform. If no such generic implementation is available spinlocks (or
+ * even OS provided semaphores) will be used to implement the API.
+ *
+ * Implement the _u64 variantes if and only if your platform can use them
+ * efficiently (and obviously correctly).
+ *
+ * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
+ * whenever possible. Writing correct code using these facilities is hard.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMICS_H
+#define ATOMICS_H
+
+#define INSIDE_ATOMICS_H
+
+#include <limits.h>
+
+/*
+ * First a set of architecture specific files is included.
+ *
+ * These files can provide the full set of atomics or can do pretty much
+ * nothing if all the compilers commonly used on these platforms provide
+ * useable generics. It will often make sense to define memory barrier
+ * semantics in those, since e.g. the compiler intrinsics for x86 memory
+ * barriers can't know we don't need read/write barriers anything more than a
+ * compiler barrier.
+ *
+ * Don't add an inline assembly of the actual atomic operations if all the
+ * common implementations of your platform provide intrinsics. Intrinsics are
+ * much easier to understand and potentially support more architectures.
+ */
+#if defined(__arm__) || defined(__arm) || \
+	defined(__aarch64__) || defined(__aarch64)
+#	include "port/atomics/arch-arm.h"
+#elif defined(__i386__) || defined(__i386) || defined(__x86_64__)
+#	include "port/atomics/arch-x86.h"
+#elif defined(__ia64__) || defined(__ia64)
+#	include "port/atomics/arch-ia64.h"
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+#	include "port/atomics/arch-ppc.h"
+#elif defined(__hppa) || defined(__hppa__)
+#	include "port/atomics/arch-hppa.h"
+#endif
+
+/*
+ * Compiler specific, but architecture independent implementations.
+ *
+ * Provide architecture independent implementations of the atomic
+ * facilities. At the very least compiler barriers should be provided, but a
+ * full implementation of
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
+ * using compiler intrinsics are a good idea.
+ */
+/* gcc or compatible, including clang and icc */
+#if defined(__GNUC__) || defined(__INTEL_COMPILER)
+#	include "port/atomics/generic-gcc.h"
+#elif defined(WIN32_ONLY_COMPILER)
+#	include "port/atomics/generic-msvc.h"
+#elif defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
+#	include "port/atomics/generic-acc.h"
+#elif defined(__SUNPRO_C) && !defined(__GNUC__)
+#	include "port/atomics/generic-sunpro.h"
+#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
+#	include "port/atomics/generic-xlc.h"
+#else
+/*
+ * Unsupported compiler, we'll likely use slower fallbacks... At least
+ * compiler barriers should really be provided.
+ */
+#endif
+
+/*
+ * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and
+ * pg_atomic_*_u32 APIs for platforms without sufficient spinlock and/or
+ * atomics support. In the case of spinlock backed atomics the emulation is
+ * expected to be efficient, although less so than native atomics support.
+ */
+#include "port/atomics/fallback.h"
+
+/*
+ * Provide additional operations using supported infrastructure. These are
+ * expected to be efficient if the underlying atomic operations are efficient.
+ */
+#include "port/atomics/generic.h"
+
+/*
+ * Provide declarations for all functions here - on most platforms static
+ * inlines are used and these aren't necessary, but when static inline is
+ * unsupported these will be external functions.
+ */
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
+
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+															 uint32 *expected, uint32 newval);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr,
+															  int32 add_, uint32 until);
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+															 uint64 *expected, uint64 newval);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+
+/*
+ * pg_compiler_barrier - prevent the compiler from moving code across
+ *
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ */
+#define pg_compiler_barrier()	pg_compiler_barrier_impl()
+
+/*
+ * pg_memory_barrier - prevent the CPU from reordering memory access
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ */
+#define pg_memory_barrier()	pg_memory_barrier_impl()
+
+/*
+ * pg_(read|write)_barrier - prevent the CPU from reordering memory access
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.	Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+#define pg_read_barrier()	pg_read_barrier_impl()
+#define pg_write_barrier()	pg_write_barrier_impl()
+
+/*
+ * Spinloop delay -
+ */
+#define pg_spin_delay()	pg_spin_delay_impl()
+
+/*
+ * The following functions are wrapper functions around the platform specific
+ * implementation of the atomic operations performing common checks.
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))
+
+/*
+ * pg_atomic_init_flag - initialize atomic flag.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_init_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_test_and_set_flag - TAS()
+ *
+ * Returns true if the flag has successfully been set, false otherwise.
+ *
+ * Acquire (including read barrier) semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_test_set_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_unlocked_test_flag - Check if the lock is free
+ *
+ * Returns true if the flag currently is not set, false otherwise.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	return pg_atomic_unlocked_test_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_clear_flag - release lock set by TAS()
+ *
+ * Release (including write barrier) semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+
+/*
+ * pg_atomic_init_u32 - initialize atomic variable
+ *
+ * Has to be done before any concurrent usage..
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_init_u32_impl(ptr, val);
+}
+
+/*
+ * pg_atomic_write_u32 - unlocked write to atomic variable.
+ *
+ * The write is guaranteed to succeed as a whole, i.e. it's not possible to
+ * observe a partial write for any reader.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_read_u32_impl(ptr);
+}
+
+/*
+ * pg_atomic_read_u32 - unlocked read from atomic variable.
+ *
+ * The read is guaranteed to return a value as it has been written by this or
+ * another process at some point in the past. There's however no cache
+ * coherency interaction guaranteeing the value hasn't since been written to
+ * again.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	pg_atomic_write_u32_impl(ptr, val);
+}
+
+/*
+ * pg_atomic_exchange_u32 - exchange newval with current value
+ *
+ * Returns the old value of 'ptr' before the swap.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+
+	return pg_atomic_exchange_u32_impl(ptr, newval);
+}
+
+/*
+ * pg_atomic_compare_exchange_u32 - CAS operation
+ *
+ * Atomically compare the current value of ptr with *expected and store newval
+ * iff ptr and *expected have the same value. The current value of *ptr will
+ * always be stored in *expected.
+ *
+ * Return true if values have been exchanged, false otherwise.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	CHECK_POINTER_ALIGNMENT(expected, 4);
+
+	return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
+}
+
+/*
+ * pg_atomic_fetch_add_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_u32_impl(ptr, add_);
+}
+
+/*
+ * pg_atomic_fetch_sub_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr before the arithmetic operation. Note that
+ * sub_ may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_);
+}
+
+/*
+ * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_and_u32_impl(ptr, and_);
+}
+
+/*
+ * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_or_u32_impl(ptr, or_);
+}
+
+/*
+ * pg_atomic_add_fetch_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_add_fetch_u32_impl(ptr, add_);
+}
+
+/*
+ * pg_atomic_sub_fetch_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr after the arithmetic operation. Note that sub_
+ * may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
+}
+
+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+
+/* ----
+ * The 64 bit operations have the same semantics as their 32bit counterparts
+ * if they are available. Check the corresponding 32bit function for
+ * documentation.
+ * ----
+ */
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	pg_atomic_init_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_read_u64_impl(ptr);
+}
+
+STATIC_IF_INLINE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	pg_atomic_write_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+
+	return pg_atomic_exchange_u64_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	CHECK_POINTER_ALIGNMENT(expected, 8);
+	return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_add_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_and_u64_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_fetch_or_u64_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	return pg_atomic_add_fetch_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
+}
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#undef INSIDE_ATOMICS_H
+
+#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/arch-arm.h b/src/include/port/atomics/arch-arm.h
new file mode 100644
index 0000000..1fe9de0
--- /dev/null
+++ b/src/include/port/atomics/arch-arm.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-arm.h
+ *	  Atomic operations considerations specific to ARM
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-arm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/*
+ * 64 bit atomics on arm are implemented using kernel fallbacks and might be
+ * slow, so disable entirely for now.
+ * XXX: We might want to disable that at some point for AARCH64
+ */
+#define PG_DISABLE_64_BIT_ATOMICS
diff --git a/src/include/port/atomics/arch-hppa.h b/src/include/port/atomics/arch-hppa.h
new file mode 100644
index 0000000..0c6d31a
--- /dev/null
+++ b/src/include/port/atomics/arch-hppa.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-hppa.h
+ *	  Atomic operations considerations specific to HPPA
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-hppa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* HPPA doesn't do either read or write reordering */
+#define pg_memory_barrier_impl()		pg_compiler_barrier_impl()
diff --git a/src/include/port/atomics/arch-ia64.h b/src/include/port/atomics/arch-ia64.h
new file mode 100644
index 0000000..92a088e
--- /dev/null
+++ b/src/include/port/atomics/arch-ia64.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-ia64.h
+ *	  Atomic operations considerations specific to intel itanium
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-ia64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		__mf()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		__asm__ __volatile__ ("mf" : : : "memory")
+#elif defined(__hpux)
+#	define pg_memory_barrier_impl()		_Asm_mf()
+#endif
diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h
new file mode 100644
index 0000000..250b6c4
--- /dev/null
+++ b/src/include/port/atomics/arch-ppc.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-ppc.h
+ *	  Atomic operations considerations specific to PowerPC
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-ppc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#endif
diff --git a/src/include/port/atomics/arch-x86.h b/src/include/port/atomics/arch-x86.h
new file mode 100644
index 0000000..15179ac
--- /dev/null
+++ b/src/include/port/atomics/arch-x86.h
@@ -0,0 +1,241 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-x86.h
+ *	  Atomic operations considerations specific to intel x86
+ *
+ * Note that we actually require a 486 upwards because the 386 doesn't have
+ * support for xadd and cmpxchg. Given that the 386 isn't supported anywhere
+ * anymore that's not much of restriction luckily.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-x86.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Both 32 and 64 bit x86 do not allow loads to be reordered with other loads,
+ * or stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ *
+ * "lock; addl" has worked for longer than "mfence". It's also rumored to be
+ * faster in many scenarios
+ */
+
+#if defined(__INTEL_COMPILER)
+#define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__) && (defined(__i386__) || defined(__i386))
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory", "cc")
+#elif defined(__GNUC__) && defined(__x86_64__)
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier_impl()
+#define pg_write_barrier_impl()		pg_compiler_barrier_impl()
+
+/*
+ * Provide implementation for atomics using inline assembly on x86 gcc. It's
+ * nice to support older gcc's and the compare/exchange implementation here is
+ * actually more efficient than the * __sync variant.
+ */
+#if defined(HAVE_ATOMICS)
+
+#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	volatile char value;
+} pg_atomic_flag;
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+/*
+ * It's too complicated to write inline asm for 64bit types on 32bit and the
+ * 468 can't do it.
+ */
+#ifdef __x86_64__
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+#endif
+
+#endif /* defined(HAVE_ATOMICS) */
+#endif /* defined(__GNUC__) && !defined(__INTEL_COMPILER) */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAS_SPIN_DELAY)
+/*
+ * This sequence is equivalent to the PAUSE instruction ("rep" is
+ * ignored by old IA32 processors if the following instruction is
+ * not a string operation); the IA-32 Architecture Software
+ * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
+ * PAUSE in the inner loop of a spin lock is necessary for good
+ * performance:
+ *
+ *     The PAUSE instruction improves the performance of IA-32
+ *     processors supporting Hyper-Threading Technology when
+ *     executing spin-wait loops and other routines where one
+ *     thread is accessing a shared lock or semaphore in a tight
+ *     polling loop. When executing a spin-wait loop, the
+ *     processor can suffer a severe performance penalty when
+ *     exiting the loop because it detects a possible memory order
+ *     violation and flushes the core processor's pipeline. The
+ *     PAUSE instruction provides a hint to the processor that the
+ *     code sequence is a spin-wait loop. The processor uses this
+ *     hint to avoid the memory order violation and prevent the
+ *     pipeline flush. In addition, the PAUSE instruction
+ *     de-pipelines the spin-wait loop to prevent it from
+ *     consuming execution resources excessively.
+ */
+#if defined(__INTEL_COMPILER)
+#define PG_HAS_SPIN_DELAY
+static inline
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#elif defined(__GNUC__)
+#define PG_HAS_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#elif defined(WIN32_ONLY_COMPILER) && defined(__x86_64__)
+#define PG_HAS_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#elif defined(WIN32_ONLY_COMPILER)
+#define PG_HAS_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	/* See comment for gcc code. Same code, MASM syntax */
+	__asm rep nop;
+}
+#endif
+#endif /* !defined(PG_HAS_SPIN_DELAY) */
+
+
+#if defined(HAVE_ATOMICS)
+
+/* inline assembly implementation for gcc */
+#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	register char _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(ptr->value)
+:
+:		"memory");
+	return _res == 0;
+}
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgl	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddl	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return res;
+}
+
+#ifdef __x86_64__
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgq	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddq	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return res;
+}
+
+#endif /* __x86_64__ */
+
+#endif /* defined(__GNUC__) && !defined(__INTEL_COMPILER) */
+
+#endif /* HAVE_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/fallback.h b/src/include/port/atomics/fallback.h
new file mode 100644
index 0000000..463bd5f
--- /dev/null
+++ b/src/include/port/atomics/fallback.h
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * fallback.h
+ *    Fallback for platforms without spinlock and/or atomics support. Slower
+ *    than native atomics support, but not unusably slow.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics/fallback.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+#ifndef pg_memory_barrier_impl
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#define PG_HAVE_MEMORY_BARRIER_EMULATION
+
+extern void pg_spinlock_barrier(void);
+#define pg_memory_barrier_impl pg_spinlock_barrier
+#endif
+
+/*
+ * If we have atomics implementation for this platform fall back to providing
+ * the atomics API using a spinlock to protect the internal state. Possibly
+ * the spinlock implementation uses semaphores internally...
+ *
+ * We have to be a bit careful here, as it's not guaranteed that atomic
+ * variables are mapped to the same address in every process (e.g. dynamic
+ * shared memory segments). We can't just hash the address and use that to map
+ * to a spinlock. Instead assign a spinlock on initialization of the atomic
+ * variable.
+ */
+#ifndef PG_HAVE_ATOMIC_FLAG_SUPPORT
+
+#define PG_HAVE_ATOMIC_FLAG_SIMULATION
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+
+typedef struct pg_atomic_flag
+{
+	/*
+	 * To avoid circular includes we can't use s_lock as a type here. Instead
+	 * just reserve enough space for all spinlock types. Some platforms would
+	 * be content with just one byte instead of 4, but that's not too much
+	 * waste.
+	 */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+} pg_atomic_flag;
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SUPPORT */
+
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+
+#define PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	/* Check pg_atomic_flag's definition above for an explanation */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* PG_HAVE_ATOMIC_U32_SUPPORT */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+extern void pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+extern bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+extern void pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * Can't do this efficiently in the semaphore based implementation - we'd
+	 * have to try to acquire the semaphore - so always return true. That's
+	 * correct, because this is only an unlocked test anyway.Do this in the
+	 * header so compilers can optimize the test away.
+	 */
+	return true;
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAS_ATOMIC_INIT_U32
+extern void pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_);
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+extern bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+												uint32 *expected, uint32 newval);
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+extern uint32 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_);
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
+
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-acc.h b/src/include/port/atomics/generic-acc.h
new file mode 100644
index 0000000..9525cde
--- /dev/null
+++ b/src/include/port/atomics/generic-acc.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-acc.h
+ *	  Atomic operations support when using HPs acc on HPUX
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * inline assembly for Itanium-based HP-UX:
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
+ * * Implementing Spinlocks on the Intel ® Itanium ® Architecture and PA-RISC
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
+ *
+ * Itanium only supports a small set of numbers (6, -8, -4, -1, 1, 4, 8, 16)
+ * for atomic add/sub, so we just implement everything but compare_exchange
+ * via the compare_exchange fallbacks in atomics/generic.h.
+ *
+ * src/include/port/atomics/generic-acc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <machine/sys/inline.h>
+
+/* IA64 always has 32/64 bit atomics */
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#define pg_compiler_barrier_impl()	_Asm_sched_fence()
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define MINOR_FENCE (_Asm_fence) (_UP_CALL_FENCE | _UP_SYS_FENCE | \
+								 _DOWN_CALL_FENCE | _DOWN_SYS_FENCE )
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	/*
+	 * We want a barrier, not just release/acquire semantics.
+	 */
+	_Asm_mf();
+	/*
+	 * Notes:
+	 * DOWN_MEM_FENCE | _UP_MEM_FENCE prevents reordering by the compiler
+	 */
+	current =  _Asm_cmpxchg(_SZ_W, /* word */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	_Asm_mf();
+	current =  _Asm_cmpxchg(_SZ_D, /* doubleword */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#undef MINOR_FENCE
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h
new file mode 100644
index 0000000..9b0c636
--- /dev/null
+++ b/src/include/port/atomics/generic-gcc.h
@@ -0,0 +1,236 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-gcc.h
+ *	  Atomic operations, implemented using gcc (or compatible) intrinsics.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Legacy __sync Built-in Functions for Atomic Memory Access
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fsync-Builtins.html
+ * * Built-in functions for memory model aware atomic operations
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
+ *
+ * src/include/port/atomics/generic-gcc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/*
+ * icc provides all the same intrinsics but doesn't understand gcc's inline asm
+ */
+#if defined(__INTEL_COMPILER)
+/* NB: Yes, __memory_barrier() is actually just a compiler barrier */
+#define pg_compiler_barrier_impl()	__memory_barrier()
+#else
+#define pg_compiler_barrier_impl()	__asm__ __volatile__("" ::: "memory")
+#endif
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory barrier
+ * out of this compiler built-in.  But we prefer to rely on platform specific
+ * definitions where possible, and use this only as a fallback.
+ */
+#if !defined(pg_memory_barrier_impl)
+#	if defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
+#	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
+#		define pg_memory_barrier_impl()		__sync_synchronize()
+#	endif
+#endif /* !defined(pg_memory_barrier_impl) */
+
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+/* acquire semantics include read barrier semantics */
+#		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
+#endif
+
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+/* release semantics include write barrier semantics */
+#		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
+#endif
+
+#ifdef HAVE_ATOMICS
+
+/* generic gcc based atomic flag implementation */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) \
+	&& (defined(HAVE_GCC__SYNC_INT32_TAS) || defined(HAVE_GCC__SYNC_CHAR_TAS))
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	/* some platforms only have a 8 bit wide TAS */
+#ifdef HAVE_GCC__SYNC_CHAR_TAS
+	volatile char value;
+#else
+	/* but an int works on more platforms */
+	volatile int value;
+#endif
+} pg_atomic_flag;
+
+#endif /* !ATOMIC_FLAG_SUPPORT && SYNC_INT32_TAS */
+
+/* generic gcc based atomic uint32 implementation */
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT) \
+	&& (defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS))
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS) */
+
+/* generic gcc based atomic uint64 implementation */
+#if !defined(PG_HAVE_ATOMIC_U64_SUPPORT) \
+	&& !defined(PG_DISABLE_64_BIT_ATOMICS) \
+	&& (defined(HAVE_GCC__ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS))
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* defined(HAVE_GCC__PG_ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS) */
+
+/*
+ * Implementation follows. Inlined or directly included from atomics.c
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && \
+	(defined(HAVE_GCC__SYNC_CHAR_TAS) || defined(HAVE_GCC__SYNC_INT32_TAS))
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* NB: only a acquire barrier, not a full one */
+	/* some platform only support a 1 here */
+	return __sync_lock_test_and_set(&ptr->value, 1) == 0;
+}
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_*_TAS) */
+
+#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value == 0;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_CLEAR_FLAG
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * XXX: It would be nicer to use __sync_lock_release here, but gcc insists
+	 * on making that an atomic op which is far to expensive and a stronger
+	 * guarantee than what we actually need.
+	 */
+	pg_write_barrier_impl();
+	ptr->value = 0;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_INIT_FLAG
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_clear_flag_impl(ptr);
+}
+#endif
+
+/* prefer __atomic, it has a better API */
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	/* FIXME: we can probably use a lower consistency model */
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+
+#if !defined(PG_DISABLE_64_BIT_ATOMICS)
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#endif /* defined(HAVE_ATOMICS) */
+
diff --git a/src/include/port/atomics/generic-msvc.h b/src/include/port/atomics/generic-msvc.h
new file mode 100644
index 0000000..174b835
--- /dev/null
+++ b/src/include/port/atomics/generic-msvc.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-msvc.h
+ *	  Atomic operations support when using MSVC
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Interlocked Variable Access
+ *   http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx
+ *
+ * src/include/port/atomics/generic-msvc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <intrin.h>
+#include <windows.h>
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/* Should work on both MSVC and Borland. */
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier_impl()	_ReadWriteBarrier()
+
+#ifndef pg_memory_barrier_impl
+#define pg_memory_barrier_impl()	MemoryBarrier()
+#endif
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return InterlockedExchangeAdd(&ptr->value, add_);
+}
+
+/*
+ * The non-intrinsics versions are only available in vista upwards, so use the
+ * intrinsic version. Only supported on >486, but we require XP as a minimum
+ * baseline, which doesn't support the 486, so we don't need to add checks for
+ * that case.
+ */
+#pragma intrinsic(_InterlockedCompareExchange64)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = _InterlockedCompareExchange64(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+/* Only implemented on itanium and 64bit builds */
+#ifdef _WIN64
+#pragma intrinsic(_InterlockedExchangeAdd64)
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return _InterlockedExchangeAdd64(&ptr->value, add_);
+}
+#endif /* _WIN64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-sunpro.h b/src/include/port/atomics/generic-sunpro.h
new file mode 100644
index 0000000..10ac70d
--- /dev/null
+++ b/src/include/port/atomics/generic-sunpro.h
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-sunpro.h
+ *	  Atomic operations for solaris' CC
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * manpage for atomic_cas(3C)
+ *   http://www.unix.com/man-page/opensolaris/3c/atomic_cas/
+ *   http://docs.oracle.com/cd/E23824_01/html/821-1465/atomic-cas-3c.html
+ *
+ * src/include/port/atomics/generic-sunpro.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	current = atomic_cas_32(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return atomic_add_32(&ptr->value, add_);
+}
+#endif
+
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_64(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return atomic_add_64(&ptr->value, add_);
+}
+#endif
+
+#endif
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
new file mode 100644
index 0000000..0b6e116
--- /dev/null
+++ b/src/include/port/atomics/generic-xlc.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-xlc.h
+ *	  Atomic operations for IBM's CC
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Synchronization and atomic built-in functions
+ *   http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/bif_sync.htm
+ *
+ * src/include/port/atomics/generic-xlc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+
+/* 64bit atomics are only supported in 64bit mode */
+#ifdef __64BIT__
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* __64BIT__ */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	/*
+	 * xlc's documentation tells us:
+	 * "If __compare_and_swap is used as a locking primitive, insert a call to
+	 * the __isync built-in function at the start of any critical sections."
+	 */
+	__isync();
+
+	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
+										 (int)*expected, (int)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+#define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	__isync();
+
+	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
+										   (long)*expected, (long)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __fetch_and_addlp(&ptr->value, add_);
+}
+#endif
+
+#endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
new file mode 100644
index 0000000..e437c72
--- /dev/null
+++ b/src/include/port/atomics/generic.h
@@ -0,0 +1,401 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic.h
+ *	  Implement higher level operations based on some lower level tomic
+ *	  operations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics/generic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ */
+#if !defined(pg_read_barrier_impl)
+#	define pg_read_barrier_impl pg_memory_barrier_impl
+#endif
+#if !defined(pg_write_barrier_impl)
+#	define pg_write_barrier_impl pg_memory_barrier_impl
+#endif
+
+#ifndef PG_HAS_SPIN_DELAY
+#define PG_HAS_SPIN_DELAY
+#define pg_spin_delay_impl()	((void)0)
+#endif
+
+
+/* provide fallback */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) && defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef pg_atomic_uint32 pg_atomic_flag;
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifndef PG_HAS_ATOMIC_READ_U32
+#define PG_HAS_ATOMIC_READ_U32
+static inline uint32
+pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	return *(&ptr->value);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U32
+#define PG_HAS_ATOMIC_WRITE_U32
+static inline void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	ptr->value = val;
+}
+#endif
+
+/*
+ * provide fallback for test_and_set using atomic_exchange if available
+ */
+#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_exchange_u32_impl(ptr, &value, 1) == 0;
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_read_u32_impl(ptr) == 0;
+}
+
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+/*
+ * provide fallback for test_and_set using atomic_compare_exchange if
+ * available.
+ */
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+
+#define PG_HAS_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAS_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 value = 0;
+	return pg_atomic_compare_exchange_u32_impl(ptr, &value, 1);
+}
+
+#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return (bool) pg_atomic_read_u32_impl(ptr) == 0;
+}
+
+#define PG_HAS_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG)
+#	error "No pg_atomic_test_and_set provided"
+#endif /* !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) */
+
+
+#ifndef PG_HAS_ATOMIC_INIT_U32
+#define PG_HAS_ATOMIC_INIT_U32
+static inline void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	pg_atomic_write_u32_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_EXCHANGE_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_EXCHANGE_U32
+static inline uint32
+pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_SUB_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_SUB_U32
+static inline uint32
+pg_atomic_fetch_sub_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_AND_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_AND_U32
+static inline uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_OR_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_OR_U32
+static inline uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_ADD_U32)
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_SUB_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_SUB_U32)
+#define PG_HAS_ATOMIC_SUB_FETCH_U32
+static inline uint32
+pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAS_ATOMIC_FETCH_ADD_UNTIL_U32
+static inline uint32
+pg_atomic_fetch_add_until_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_, uint32 until)
+{
+	uint32 old;
+	while (true)
+	{
+		uint32 new_val;
+		old = pg_atomic_read_u32_impl(ptr);
+		if (old >= until)
+			break;
+
+		new_val = old + add_;
+		if (new_val > until)
+			new_val = until;
+
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, new_val))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+#if !defined(PG_HAS_ATOMIC_EXCHANGE_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_EXCHANGE_U64
+static inline uint64
+pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = ptr->value;
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+static inline void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	/*
+	 * 64 bit writes aren't safe on all platforms. In the generic
+	 * implementation implement them as an atomic exchange. That might store a
+	 * 0, but only if the prev. value also was a 0.
+	 */
+	pg_atomic_exchange_u64_impl(ptr, val);
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_READ_U64
+#define PG_HAS_ATOMIC_READ_U64
+static inline uint64
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	uint64 old = 0;
+
+	/*
+	 * 64 bit reads aren't safe on all platforms. In the generic
+	 * implementation implement them as a compare/exchange with 0. That'll
+	 * fail or suceed, but always return the old value
+	 */
+	pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
+
+	return old;
+}
+#endif
+
+#ifndef PG_HAS_ATOMIC_INIT_U64
+#define PG_HAS_ATOMIC_INIT_U64
+static inline void
+pg_atomic_init_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val_)
+{
+	pg_atomic_write_u64_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_ADD_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_SUB_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_SUB_U64
+static inline uint64
+pg_atomic_fetch_sub_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_AND_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_AND_U64
+static inline uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_FETCH_OR_U64) && defined(PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAS_ATOMIC_FETCH_OR_U64
+static inline uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U64) && defined(PG_HAS_ATOMIC_FETCH_ADD_U64)
+#define PG_HAS_ATOMIC_ADD_FETCH_U64
+static inline uint64
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAS_ATOMIC_SUB_FETCH_U64) && defined(PG_HAS_ATOMIC_FETCH_SUB_U64)
+#define PG_HAS_ATOMIC_SUB_FETCH_U64
+static inline uint64
+pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#endif /* PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index abb82c2..b36705b 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -13,158 +13,11 @@
 #ifndef BARRIER_H
 #define BARRIER_H
 
-#include "storage/s_lock.h"
-
-extern slock_t dummy_spinlock;
-
-/*
- * A compiler barrier need not (and preferably should not) emit any actual
- * machine code, but must act as an optimization fence: the compiler must not
- * reorder loads or stores to main memory around the barrier.  However, the
- * CPU may still reorder loads or stores at runtime, if the architecture's
- * memory model permits this.
- *
- * A memory barrier must act as a compiler barrier, and in addition must
- * guarantee that all loads and stores issued prior to the barrier are
- * completed before any loads or stores issued after the barrier.  Unless
- * loads and stores are totally ordered (which is not the case on most
- * architectures) this requires issuing some sort of memory fencing
- * instruction.
- *
- * A read barrier must act as a compiler barrier, and in addition must
- * guarantee that any loads issued prior to the barrier are completed before
- * any loads issued after the barrier.  Similarly, a write barrier acts
- * as a compiler barrier, and also orders stores.  Read and write barriers
- * are thus weaker than a full memory barrier, but stronger than a compiler
- * barrier.  In practice, on machines with strong memory ordering, read and
- * write barriers may require nothing more than a compiler barrier.
- *
- * For an introduction to using memory barriers within the PostgreSQL backend,
- * see src/backend/storage/lmgr/README.barrier
- */
-
-#if defined(DISABLE_BARRIERS)
-
-/*
- * Fall through to the spinlock-based implementation.
- */
-#elif defined(__INTEL_COMPILER)
-
-/*
- * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
- */
-#if defined(__ia64__) || defined(__ia64)
-#define pg_memory_barrier()		__mf()
-#elif defined(__i386__) || defined(__x86_64__)
-#define pg_memory_barrier()		_mm_mfence()
-#endif
-
-#define pg_compiler_barrier()	__memory_barrier()
-#elif defined(__GNUC__)
-
-/* This works on any architecture, since it's only talking to GCC itself. */
-#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
-
-#if defined(__i386__)
-
-/*
- * i386 does not allow loads to be reordered with other loads, or stores to be
- * reordered with other stores, but a load can be performed before a subsequent
- * store.
- *
- * "lock; addl" has worked for longer than "mfence".
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory", "cc")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__x86_64__)		/* 64 bit x86 */
-
 /*
- * x86_64 has similar ordering characteristics to i386.
- *
- * Technically, some x86-ish chips support uncached memory access and/or
- * special instructions that are weakly ordered.  In those cases we'd need
- * the read and write barriers to be lfence and sfence.  But since we don't
- * do those things, a compiler barrier should be enough.
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__ia64__) || defined(__ia64)
-
-/*
- * Itanium is weakly ordered, so read and write barriers require a full
- * fence.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-
-/*
- * lwsync orders loads with respect to each other, and similarly with stores.
- * But a load can be performed before a subsequent store, so sync must be used
- * for a full memory barrier.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-
-#elif defined(__hppa) || defined(__hppa__)		/* HP PA-RISC */
-
-/* HPPA doesn't do either read or write reordering */
-#define pg_memory_barrier()		pg_compiler_barrier()
-#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
-
-/*
- * If we're on GCC 4.1.0 or higher, we should be able to get a memory
- * barrier out of this compiler built-in.  But we prefer to rely on our
- * own definitions where possible, and use this only as a fallback.
- */
-#define pg_memory_barrier()		__sync_synchronize()
-#endif
-#elif defined(__ia64__) || defined(__ia64)
-
-#define pg_compiler_barrier()	_Asm_sched_fence()
-#define pg_memory_barrier()		_Asm_mf()
-#elif defined(WIN32_ONLY_COMPILER)
-
-/* Should work on both MSVC and Borland. */
-#include <intrin.h>
-#pragma intrinsic(_ReadWriteBarrier)
-#define pg_compiler_barrier()	_ReadWriteBarrier()
-#define pg_memory_barrier()		MemoryBarrier()
-#endif
-
-/*
- * If we have no memory barrier implementation for this architecture, we
- * fall back to acquiring and releasing a spinlock.  This might, in turn,
- * fall back to the semaphore-based spinlock implementation, which will be
- * amazingly slow.
- *
- * It's not self-evident that every possible legal implementation of a
- * spinlock acquire-and-release would be equivalent to a full memory barrier.
- * For example, I'm not sure that Itanium's acq and rel add up to a full
- * fence.  But all of our actual implementations seem OK in this regard.
- */
-#if !defined(pg_memory_barrier)
-#define pg_memory_barrier() \
-	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
-#endif
-
-/*
- * If read or write barriers are undefined, we upgrade them to full memory
- * barriers.
- *
- * If a compiler barrier is unavailable, you probably don't want a full
- * memory barrier instead, so if you have a use case for a compiler barrier,
- * you'd better use #ifdef.
+ * This used to be a separate file, full of compiler/architecture
+ * dependent defines, but it's not included in the atomics.h
+ * infrastructure and just kept for backward compatibility.
  */
-#if !defined(pg_read_barrier)
-#define pg_read_barrier()			pg_memory_barrier()
-#endif
-#if !defined(pg_write_barrier)
-#define pg_write_barrier()			pg_memory_barrier()
-#endif
+#include "port/atomics.h"
 
 #endif   /* BARRIER_H */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 5f0ad7a..e7ae330 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -889,12 +889,12 @@ typedef int slock_t;
 
 extern bool s_lock_free_sema(volatile slock_t *lock);
 extern void s_unlock_sema(volatile slock_t *lock);
-extern void s_init_lock_sema(volatile slock_t *lock);
+extern void s_init_lock_sema(volatile slock_t *lock, bool nested);
 extern int	tas_sema(volatile slock_t *lock);
 
 #define S_LOCK_FREE(lock)	s_lock_free_sema(lock)
 #define S_UNLOCK(lock)	 s_unlock_sema(lock)
-#define S_INIT_LOCK(lock)	s_init_lock_sema(lock)
+#define S_INIT_LOCK(lock)	s_init_lock_sema(lock, false)
 #define TAS(lock)	tas_sema(lock)
 
 
@@ -955,6 +955,7 @@ extern int	tas(volatile slock_t *lock);		/* in port/.../tas.s, or
 #define TAS_SPIN(lock)	TAS(lock)
 #endif	 /* TAS_SPIN */
 
+extern slock_t dummy_spinlock;
 
 /*
  * Platform-independent out-of-line support routines
diff --git a/src/test/regress/expected/lock.out b/src/test/regress/expected/lock.out
index 0d7c3ba..fd27344 100644
--- a/src/test/regress/expected/lock.out
+++ b/src/test/regress/expected/lock.out
@@ -60,3 +60,11 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
+ test_atomic_ops 
+-----------------
+ t
+(1 row)
+
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index aef1518..1fded84 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -57,6 +57,11 @@ CREATE FUNCTION make_tuple_indirect (record)
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
 
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
+
 -- Things that shouldn't work:
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index 9761d12..9926c90 100644
--- a/src/test/regress/output/create_function_1.source
+++ b/src/test/regress/output/create_function_1.source
@@ -51,6 +51,10 @@ CREATE FUNCTION make_tuple_indirect (record)
         RETURNS record
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 09e027c..2bd611d 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -18,6 +18,7 @@
 #include "executor/executor.h"
 #include "executor/spi.h"
 #include "miscadmin.h"
+#include "port/atomics.h"
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/rel.h"
@@ -865,3 +866,235 @@ wait_pid(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+static void
+test_atomic_flag(void)
+{
+	pg_atomic_flag flag;
+
+	pg_atomic_init_flag(&flag);
+
+	if (!pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly set");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	if (pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly unset");
+
+	if (pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: set spuriously #2");
+
+	pg_atomic_clear_flag(&flag);
+
+	if (!pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly set #2");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	pg_atomic_clear_flag(&flag);
+}
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+static void
+test_atomic_uint32(void)
+{
+	pg_atomic_uint32 var;
+	uint32 expected;
+	int i;
+
+	pg_atomic_init_u32(&var, 0);
+
+	if (pg_atomic_read_u32(&var) != 0)
+		elog(ERROR, "atomic_read_u32() #1 wrong");
+
+	pg_atomic_write_u32(&var, 3);
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u32() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u32() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u32(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u32() #1 wrong");
+
+	if (pg_atomic_add_fetch_u32(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u32() #0 wrong");
+
+	/* test around numerical limits */
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != 0)
+		elog(ERROR, "pg_atomic_fetch_add_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != INT_MAX)
+		elog(ERROR, "pg_atomic_add_fetch_u32() #3 wrong");
+
+	pg_atomic_fetch_add_u32(&var, 1); /* top up to UINT_MAX */
+
+	if (pg_atomic_read_u32(&var) != UINT_MAX)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, INT_MAX) != UINT_MAX)
+		elog(ERROR, "pg_atomic_fetch_sub_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != (uint32)INT_MAX + 1)
+		elog(ERROR, "atomic_read_u32() #3 wrong: %u", pg_atomic_read_u32(&var));
+
+	expected = pg_atomic_sub_fetch_u32(&var, INT_MAX);
+	if (expected != 1)
+		elog(ERROR, "pg_atomic_sub_fetch_u32() #3 wrong: %u", expected);
+
+	pg_atomic_sub_fetch_u32(&var, 1);
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u32(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u32() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 1000; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u32(&var, &expected, 1))
+			break;
+	}
+	if (i == 1000)
+		elog(ERROR, "atomic_compare_exchange_u32() never succeeded");
+	if (pg_atomic_read_u32(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u32() didn't set value properly");
+
+	pg_atomic_write_u32(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u32(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u32() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u32(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u32()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u32(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #1 wrong");
+
+	if (pg_atomic_fetch_and_u32(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #2 wrong: is %u",
+			 pg_atomic_read_u32(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u32(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #3 wrong");
+}
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+static void
+test_atomic_uint64(void)
+{
+	pg_atomic_uint64 var;
+	uint64 expected;
+	int i;
+
+	pg_atomic_init_u64(&var, 0);
+
+	if (pg_atomic_read_u64(&var) != 0)
+		elog(ERROR, "atomic_read_u64() #1 wrong");
+
+	pg_atomic_write_u64(&var, 3);
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "atomic_read_u64() #2 wrong");
+
+	if (pg_atomic_fetch_add_u64(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u64() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u64(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u64() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u64(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u64() #1 wrong");
+
+	if (pg_atomic_add_fetch_u64(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u64() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u64(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u64() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 100; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u64(&var, &expected, 1))
+			break;
+	}
+	if (i == 100)
+		elog(ERROR, "atomic_compare_exchange_u64() never succeeded");
+	if (pg_atomic_read_u64(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u64() didn't set value properly");
+
+	pg_atomic_write_u64(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u64(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u64() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u64(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u64() #2 wrong");
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u64()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u64(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #1 wrong");
+
+	if (pg_atomic_fetch_and_u64(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #2 wrong: is "UINT64_FORMAT,
+			 pg_atomic_read_u64(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u64(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
+
+
+PG_FUNCTION_INFO_V1(test_atomic_ops);
+Datum
+test_atomic_ops(PG_FUNCTION_ARGS)
+{
+	/*
+	 * Can't run the test under the semaphore emulation, it doesn't handle
+	 * checking edge cases well.
+	 */
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+	test_atomic_flag();
+#endif
+
+	test_atomic_uint32();
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+	test_atomic_uint64();
+#endif
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/test/regress/sql/lock.sql b/src/test/regress/sql/lock.sql
index dda212f..567e8bc 100644
--- a/src/test/regress/sql/lock.sql
+++ b/src/test/regress/sql/lock.sql
@@ -64,3 +64,8 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+
+
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
-- 
2.0.0.rc2.4.g1dc51c6.dirty

0002-Add-implementation-of-spinlocks-using-the-atomics.h-.patchtext/x-patch; charset=us-asciiDownload
>From 7a02b4d4ce1b744b244f7e4f0e2716f379301b39 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 18 Sep 2014 12:45:41 +0200
Subject: [PATCH 2/5] Add implementation of spinlocks using the atomics.h API.

Use the new atomics.h API to provide s_lock.h spinlock functions if no
os/architecture specific implementation exists. Eventually we might
want to replace the entirety of s_lock.h with the new API, but given
how critical it is, it seems better to do it in smaller steps.

As the s_lock.h ARM implementation is the same as the atomics
implementation, remove it.

The FORCE_ATOMICS_BASED_SPINLOCKS define can be used to force the
usage of the atomics.h based spinlocks.
---
 src/include/pg_config_manual.h |  6 ++++++
 src/include/storage/s_lock.h   | 44 +++++++++++++++++-------------------------
 2 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 195fb26..4bc591a 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -73,6 +73,12 @@
 #define NUM_ATOMICS_SEMAPHORES		64
 
 /*
+ * Force usage of atomics.h based spinlocks even if s_lock.h has its own
+ * support for the target platform.
+ */
+#undef FORCE_ATOMICS_BASED_SPINLOCKS
+
+/*
  * Define this if you want to allow the lo_import and lo_export SQL
  * functions to be executed by ordinary users.  By default these
  * functions are only available to the Postgres superuser.  CAUTION:
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index e7ae330..f49d17f 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -98,6 +98,8 @@
 
 #ifdef HAVE_SPINLOCKS	/* skip spinlocks if requested */
 
+#if !defined(FORCE_ATOMICS_BASED_SPINLOCKS)
+
 #if defined(__GNUC__) || defined(__INTEL_COMPILER)
 /*************************************************************************
  * All the gcc inlines
@@ -302,32 +304,6 @@ tas(volatile slock_t *lock)
 #endif /* __INTEL_COMPILER */
 #endif	 /* __ia64__ || __ia64 */
 
-/*
- * On ARM and ARM64, we use __sync_lock_test_and_set(int *, int) if available.
- *
- * We use the int-width variant of the builtin because it works on more chips
- * than other widths.
- */
-#if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC_INT_ATOMICS
-#define HAS_TEST_AND_SET
-
-#define TAS(lock) tas(lock)
-
-typedef int slock_t;
-
-static __inline__ int
-tas(volatile slock_t *lock)
-{
-	return __sync_lock_test_and_set(lock, 1);
-}
-
-#define S_UNLOCK(lock) __sync_lock_release(lock)
-
-#endif	 /* HAVE_GCC_INT_ATOMICS */
-#endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 */
-
-
 /* S/390 and S/390x Linux (32- and 64-bit zSeries) */
 #if defined(__s390__) || defined(__s390x__)
 #define HAS_TEST_AND_SET
@@ -870,6 +846,22 @@ spin_delay(void)
 
 #endif	/* !defined(HAS_TEST_AND_SET) */
 
+#endif /* !defined(FORCE_ATOMICS_BASED_SPINLOCKS) */
+
+
+#if !defined(HAS_TEST_AND_SET) && defined(HAVE_ATOMICS)
+#include "port/atomics.h"
+
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+#define HAS_TEST_AND_SET
+typedef pg_atomic_flag slock_t;
+#define TAS(lock)			(pg_atomic_test_set_flag(lock) ? 0 : 1)
+#define TAS_SPIN(lock)		(pg_atomic_unlocked_test_flag(lock) ? TAS(lock) : 1)
+#define S_UNLOCK(lock)		pg_atomic_clear_flag(lock)
+#define S_INIT_LOCK(lock)	pg_atomic_init_flag(lock)
+#define SPIN_DELAY()		pg_spin_delay()
+#endif /* !PG_HAVE_ATOMIC_FLAG_SIMULATION */
+#endif /* !defined(HAS_TEST_AND_SET) && defined(HAVE_ATOMICS) */
 
 /* Blow up if we didn't have any way to do spinlocks */
 #ifndef HAS_TEST_AND_SET
-- 
2.0.0.rc2.4.g1dc51c6.dirty

0003-XXX-Hack-ShmemAlloc-to-align-all-allocations-to-cach.patchtext/x-patch; charset=us-asciiDownload
>From 304d6ff56f53e586e161dde9d62789facce74853 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Feb 2014 13:34:34 +0100
Subject: [PATCH 3/5] XXX: Hack ShmemAlloc() to align all allocations to
 cacheline boundaries

---
 src/backend/storage/ipc/shmem.c | 4 ++--
 src/include/pg_config_manual.h  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 2ea2216..193b6dc 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -184,8 +184,8 @@ ShmemAlloc(Size size)
 
 	newStart = shmemseghdr->freeoffset;
 
-	/* extra alignment for large requests, since they are probably buffers */
-	if (size >= BLCKSZ)
+	/* align all request to cacheline boundaries */
+	if (size >= BUFSIZ)
 		newStart = BUFFERALIGN(newStart);
 
 	newFree = newStart + size;
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 4bc591a..6576956 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -123,7 +123,7 @@
  * is aligned on a larger-than-MAXALIGN boundary.  Ideally this should be
  * a platform-dependent value, but for now we just hard-wire it.
  */
-#define ALIGNOF_BUFFER	32
+#define ALIGNOF_BUFFER	64
 
 /*
  * Disable UNIX sockets for certain operating systems.
-- 
2.0.0.rc2.4.g1dc51c6.dirty

0004-Convert-the-PGPROC-lwWaitLink-list-into-a-dlist-inst.patchtext/x-patch; charset=us-asciiDownload
>From 5e456f71e5e83b6ccb8ad471b6327592242f0cd4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 31 Jan 2014 10:53:19 +0100
Subject: [PATCH 4/5] Convert the PGPROC->lwWaitLink list into a dlist instead
 of open coding it.

Besides being shorter and much easier to read it:
* changes the logic in LWLockRelease() to release all shared lockers
  when waking up any.
* adds a memory pg_write_barrier() in the wakeup paths between
  dequeuing and unsetting ->lwWaiting. On a weakly ordered machine the
  current coding could lead to problems.
---
 src/backend/access/transam/twophase.c |   1 -
 src/backend/storage/lmgr/lwlock.c     | 151 +++++++++++++---------------------
 src/backend/storage/lmgr/proc.c       |   2 -
 src/include/storage/lwlock.h          |   5 +-
 src/include/storage/proc.h            |   3 +-
 5 files changed, 60 insertions(+), 102 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d5409a6..6401943 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -389,7 +389,6 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 	proc->roleId = owner;
 	proc->lwWaiting = false;
 	proc->lwWaitMode = 0;
-	proc->lwWaitLink = NULL;
 	proc->waitLock = NULL;
 	proc->waitProcLock = NULL;
 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..055174d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -35,6 +35,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "replication/slot.h"
+#include "storage/barrier.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -115,9 +116,9 @@ inline static void
 PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%s %d): excl %d shared %d head %p rOK %d",
+		elog(LOG, "%s(%s %d): excl %d shared %d rOK %d",
 			 where, T_NAME(lock), T_ID(lock),
-			 (int) lock->exclusive, lock->shared, lock->head,
+			 (int) lock->exclusive, lock->shared,
 			 (int) lock->releaseOK);
 }
 
@@ -479,8 +480,7 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 	lock->exclusive = 0;
 	lock->shared = 0;
 	lock->tranche = tranche_id;
-	lock->head = NULL;
-	lock->tail = NULL;
+	dlist_init(&lock->waiters);
 }
 
 
@@ -620,12 +620,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 
 		proc->lwWaiting = true;
 		proc->lwWaitMode = mode;
-		proc->lwWaitLink = NULL;
-		if (lock->head == NULL)
-			lock->head = proc;
-		else
-			lock->tail->lwWaitLink = proc;
-		lock->tail = proc;
+		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&lock->mutex);
@@ -841,12 +836,7 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 
 		proc->lwWaiting = true;
 		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-		if (lock->head == NULL)
-			lock->head = proc;
-		else
-			lock->tail->lwWaitLink = proc;
-		lock->tail = proc;
+		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&lock->mutex);
@@ -1000,13 +990,8 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		 */
 		proc->lwWaiting = true;
 		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-
 		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = lock->head;
-		if (lock->head == NULL)
-			lock->tail = proc;
-		lock->head = proc;
+		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
 
 		/* Can release the mutex now */
 		SpinLockRelease(&lock->mutex);
@@ -1082,9 +1067,10 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 {
 	volatile LWLock *lock = l;
 	volatile uint64 *valp = valptr;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
+
+	dlist_init(&wakeup);
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
 	SpinLockAcquire(&lock->mutex);
@@ -1099,24 +1085,16 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
 	 * up. They are always in the front of the queue.
 	 */
-	head = lock->head;
-
-	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
 	{
-		proc = head;
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
-		{
-			proc = next;
-			next = next->lwWaitLink;
-		}
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 
-		/* proc is now the last PGPROC to be released */
-		lock->head = next;
-		proc->lwWaitLink = NULL;
+		if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
+			break;
+
+		dlist_delete(&waiter->lwWaitLink);
+		dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 	}
-	else
-		head = NULL;
 
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
@@ -1124,13 +1102,13 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	while (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
 	}
 }
 
@@ -1142,10 +1120,12 @@ void
 LWLockRelease(LWLock *l)
 {
 	volatile LWLock *lock = l;
-	PGPROC	   *head;
-	PGPROC	   *proc;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
 	int			i;
 
+	dlist_init(&wakeup);
+
 	PRINT_LWDEBUG("LWLockRelease", lock);
 
 	/*
@@ -1181,58 +1161,39 @@ LWLockRelease(LWLock *l)
 	 * if someone has already awakened waiters that haven't yet acquired the
 	 * lock.
 	 */
-	head = lock->head;
-	if (head != NULL)
+	if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
+		/*
+		 * Remove the to-be-awakened PGPROCs from the queue.
+		 */
+		bool		releaseOK = true;
+		bool		wokeup_somebody = false;
+
+		dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
 		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
+			PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 
-			proc = head;
+			if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
+				continue;
 
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
+			dlist_delete(&waiter->lwWaitLink);
+			dlist_push_tail(&wakeup, &waiter->lwWaitLink);
 
 			/*
-			 * If the front waiter wants exclusive lock, awaken him only.
-			 * Otherwise awaken as many waiters as want shared access.
+			 * Prevent additional wakeups until retryer gets to
+			 * run. Backends that are just waiting for the lock to become
+			 * free don't retry automatically.
 			 */
-			if (proc->lwWaitMode != LW_EXCLUSIVE)
+			if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
 			{
-				while (proc->lwWaitLink != NULL &&
-					   proc->lwWaitLink->lwWaitMode != LW_EXCLUSIVE)
-				{
-					if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
-						releaseOK = false;
-					proc = proc->lwWaitLink;
-				}
-			}
-			/* proc is now the last PGPROC to be released */
-			lock->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			/*
-			 * Prevent additional wakeups until retryer gets to run. Backends
-			 * that are just waiting for the lock to become free don't retry
-			 * automatically.
-			 */
-			if (proc->lwWaitMode != LW_WAIT_UNTIL_FREE)
 				releaseOK = false;
+				wokeup_somebody = true;
+			}
 
-			lock->releaseOK = releaseOK;
-		}
-		else
-		{
-			/* lock is still held, can't awaken anything */
-			head = NULL;
+			if(waiter->lwWaitMode == LW_EXCLUSIVE)
+				break;
 		}
+		lock->releaseOK = releaseOK;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1243,14 +1204,14 @@ LWLockRelease(LWLock *l)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	while (head != NULL)
+	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
 	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
 	}
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ea88a24..a4789fc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -372,7 +372,6 @@ InitProcess(void)
 		MyPgXact->vacuumFlags |= PROC_IS_AUTOVACUUM;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
-	MyProc->lwWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
@@ -535,7 +534,6 @@ InitAuxiliaryProcess(void)
 	MyPgXact->vacuumFlags = 0;
 	MyProc->lwWaiting = false;
 	MyProc->lwWaitMode = 0;
-	MyProc->lwWaitLink = NULL;
 	MyProc->waitLock = NULL;
 	MyProc->waitProcLock = NULL;
 #ifdef USE_ASSERT_CHECKING
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..1bca57b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -14,6 +14,7 @@
 #ifndef LWLOCK_H
 #define LWLOCK_H
 
+#include "lib/ilist.h"
 #include "storage/s_lock.h"
 
 struct PGPROC;
@@ -50,9 +51,7 @@ typedef struct LWLock
 	char		exclusive;		/* # of exclusive holders (0 or 1) */
 	int			shared;			/* # of shared holders (0..MaxBackends) */
 	int			tranche;		/* tranche ID */
-	struct PGPROC *head;		/* head of list of waiting PGPROCs */
-	struct PGPROC *tail;		/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
+	dlist_head	waiters;		/* list of waiting PGPROCs */
 } LWLock;
 
 /*
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..31f1b04 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,6 +15,7 @@
 #define _PROC_H_
 
 #include "access/xlogdefs.h"
+#include "lib/ilist.h"
 #include "storage/latch.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"
@@ -104,7 +105,7 @@ struct PGPROC
 	/* Info about LWLock the process is currently waiting for, if any. */
 	bool		lwWaiting;		/* true if waiting for an LW lock */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
-	struct PGPROC *lwWaitLink;	/* next waiter for same LW lock */
+	dlist_node	lwWaitLink;		/* next waiter for same LW lock */
 
 	/* Info about lock the process is currently waiting for, if any. */
 	/* waitLock and waitProcLock are NULL if not currently waiting. */
-- 
2.0.0.rc2.4.g1dc51c6.dirty

0005-Wait-free-LW_SHARED-lwlock-acquiration.patchtext/x-patch; charset=us-asciiDownload
>From cd4b4494f17cb8e11dc1f8e2055ae6e2ba406040 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 18 Sep 2014 16:14:16 +0200
Subject: [PATCH 5/5] Wait free LW_SHARED lwlock acquiration

---
 src/backend/storage/lmgr/lwlock.c | 896 +++++++++++++++++++++++++++-----------
 src/include/storage/lwlock.h      |   9 +-
 src/include/storage/s_lock.h      |   2 +
 3 files changed, 654 insertions(+), 253 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 055174d..d3d0f73 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -24,6 +24,82 @@
  * IDENTIFICATION
  *	  src/backend/storage/lmgr/lwlock.c
  *
+ * NOTES:
+ *
+ * This used to be a pretty straight forward reader-writer lock
+ * implementation, in which the internal state was protected by a
+ * spinlock. Unfortunately the overhead of taking the spinlock proved to be
+ * too high for workloads/locks that were locked in shared mode very
+ * frequently.
+ * Thus a new implementation was devised that provides wait-free shared lock
+ * acquiration for locks that aren't exclusively locked.
+ *
+ * The basic idea is to have a single atomic variable 'lockcount' instead of
+ * the formerly separate shared and exclusive counters and to use an atomic
+ * increment to acquire the lock. That's fairly easy to do for rw-spinlocks,
+ * but a lot harder for something like LWLocks that want to wait in the OS.
+ *
+ * For exlusive lock acquisition we use an atomic compare-and-exchange on the
+ * lockcount variable swapping in EXCLUSIVE_LOCK/1<<31-1/0x7FFFFFFF if and only
+ * if the current value of lockcount is 0. If the swap was not successfull, we
+ * have to wait.
+ *
+ * For shared lock acquisition we use an atomic add (lock xadd) to the
+ * lockcount variable to add 1. If the value is bigger than EXCLUSIVE_LOCK we
+ * know that somebody actually has an exclusive lock, and we back out by
+ * atomically decrementing by 1 again. If so, we have to wait for the exlusive
+ * locker to release the lock.
+ *
+ * To release the lock we use an atomic decrement to release the lock. If the
+ * new value is zero (we get that atomically), we know we have to release
+ * waiters.
+ *
+ * The attentive reader probably might have noticed that naively doing the
+ * above has two glaring race conditions:
+ *
+ * 1) too-quick-for-queueing: We try to lock using the atomic operations and
+ * notice that we have to wait. Unfortunately until we have finished queuing,
+ * the former locker very well might have already finished it's work. That's
+ * problematic because we're now stuck waiting inside the OS.
+ *
+ * 2) spurious failed locks: Due to the logic of backing out of shared
+ * locks after we unconditionally added a 1 to lockcount, we might have
+ * prevented another exclusive locker from getting the lock:
+ *   1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success
+ *   2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1
+ *   3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK
+ *   4) Session A: LWLockRelease()
+ *   5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait.
+ *   6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1
+ *   7) Session B: LWLockAcquire(LW_SHARED) - wait
+ *
+ * So we'd now have both B) and C) waiting on a lock that nobody is holding
+ * anymore. Not good.
+ *
+ * To mitigate those races we use a two phased attempt at locking:
+ *   Phase 1: * Try to do it atomically, if we succeed, nice
+ *   Phase 2: Add us too the waitqueue of the lock
+ *   Phase 3: Try to grab the lock again, if we succeed, remove ourselves from
+ *            the queue
+ *   Phase 4: Sleep till wakeup, goto Phase 1
+ *
+ * This protects us against both problems from above:
+ * 1) Nobody can release too quick, before we're queued, since after Phase 2
+ *    we're already queued.
+ * 2) If somebody spuriously got blocked from acquiring the lock, they will
+ *    get queued in Phase 2 and we can wake them up if neccessary or they will
+ *    have gotten the lock in Phase 3.
+ *
+ * There above algorithm only works for LWLockAcquire, not directly for
+ * LWLockAcquireConditional where we don't want to wait. In that case we just
+ * need to retry acquiring the lock until we're sure we didn't disturb anybody
+ * in doing so.
+ *
+ * TODO:
+ * - decide if we need a spinlock fallback
+ * - expand documentation
+ * - make LWLOCK_STATS do something sensible again
+ * - make LOCK_DEBUG output nicer
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -35,7 +111,6 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "replication/slot.h"
-#include "storage/barrier.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -50,6 +125,10 @@
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
 
+#define EXCLUSIVE_LOCK (((uint32) 1) << (31 - 1))
+/* must be greater than MAX_BACKENDS */
+#define SHARED_LOCK_MASK (~EXCLUSIVE_LOCK)
+
 /*
  * This is indexed by tranche ID and stores metadata for all tranches known
  * to the current backend.
@@ -80,8 +159,14 @@ static LWLockTranche MainLWLockTranche;
  */
 #define MAX_SIMUL_LWLOCKS	100
 
+typedef struct LWLockHandle
+{
+	LWLock *lock;
+	LWLockMode	mode;
+} LWLockHandle;
+
 static int	num_held_lwlocks = 0;
-static LWLock *held_lwlocks[MAX_SIMUL_LWLOCKS];
+static LWLockHandle held_lwlocks[MAX_SIMUL_LWLOCKS];
 
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
@@ -100,8 +185,11 @@ typedef struct lwlock_stats
 {
 	lwlock_stats_key key;
 	int			sh_acquire_count;
+	int			sh_attempt_backout;
 	int			ex_acquire_count;
+	int			ex_race;
 	int			block_count;
+	int			dequeue_self_count;
 	int			spin_delay_count;
 }	lwlock_stats;
 
@@ -113,23 +201,30 @@ static lwlock_stats lwlock_stats_dummy;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, volatile LWLock *lock, LWLockMode mode)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%s %d): excl %d shared %d rOK %d",
+	{
+		uint32 lockcount = pg_atomic_read_u32(&lock->lockcount);
+
+		elog(LOG, "%d: %s(%s %d): excl %u shared %u waiters %u rOK %d\n",
+			 MyProcPid,
 			 where, T_NAME(lock), T_ID(lock),
-			 (int) lock->exclusive, lock->shared,
+			 lockcount >= EXCLUSIVE_LOCK,
+			 lockcount & SHARED_LOCK_MASK,
+			 pg_atomic_read_u32(&lock->nwaiters),
 			 (int) lock->releaseOK);
+	}
 }
 
 inline static void
 LOG_LWDEBUG(const char *where, const char *name, int index, const char *msg)
 {
 	if (Trace_lwlocks)
-		elog(LOG, "%s(%s %d): %s", where, name, index, msg);
+		elog(LOG, "%d: %s(%s %d): %s\n", MyProcPid, where, name, index, msg);
 }
 #else							/* not LOCK_DEBUG */
-#define PRINT_LWDEBUG(a,b)
+#define PRINT_LWDEBUG(a,b,c)
 #define LOG_LWDEBUG(a,b,c,d)
 #endif   /* LOCK_DEBUG */
 
@@ -192,11 +287,12 @@ print_lwlock_stats(int code, Datum arg)
 	while ((lwstats = (lwlock_stats *) hash_seq_search(&scan)) != NULL)
 	{
 		fprintf(stderr,
-			  "PID %d lwlock %s %d: shacq %u exacq %u blk %u spindelay %u\n",
+				"PID %d lwlock %s %d: shacq %u exacq %u blk %u spindelay %u, backout %u, ex race %u, dequeue self %u\n",
 				MyProcPid, LWLockTrancheArray[lwstats->key.tranche]->name,
 				lwstats->key.instance, lwstats->sh_acquire_count,
 				lwstats->ex_acquire_count, lwstats->block_count,
-				lwstats->spin_delay_count);
+				lwstats->spin_delay_count, lwstats->sh_attempt_backout,
+				lwstats->ex_race, lwstats->dequeue_self_count);
 	}
 
 	LWLockRelease(&MainLWLockArray[0].lock);
@@ -224,8 +320,11 @@ get_lwlock_stats_entry(LWLock *lock)
 	if (!found)
 	{
 		lwstats->sh_acquire_count = 0;
+		lwstats->sh_attempt_backout = 0;
 		lwstats->ex_acquire_count = 0;
+		lwstats->ex_race = 0;
 		lwstats->block_count = 0;
+		lwstats->dequeue_self_count = 0;
 		lwstats->spin_delay_count = 0;
 	}
 	return lwstats;
@@ -477,12 +576,302 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 {
 	SpinLockInit(&lock->mutex);
 	lock->releaseOK = true;
-	lock->exclusive = 0;
-	lock->shared = 0;
+	pg_atomic_init_u32(&lock->lockcount, 0);
+	pg_atomic_init_u32(&lock->nwaiters, 0);
 	lock->tranche = tranche_id;
 	dlist_init(&lock->waiters);
 }
 
+/*
+ * Internal function handling the atomic manipulation of lock->lockcount.
+ *
+ * 'double_check' = true means that we try to check more carefully
+ * against causing somebody else to spuriously believe the lock is
+ * already taken, although we're just about to back out of it.
+ */
+static inline bool
+LWLockAttemptLock(LWLock* l, LWLockMode mode, bool double_check, bool *potentially_spurious)
+{
+	bool		mustwait;
+	uint32		oldstate;
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+	lwstats = get_lwlock_stats_entry(l);
+#endif
+
+	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+	*potentially_spurious = false;
+
+	if (mode == LW_EXCLUSIVE)
+	{
+		uint32 expected;
+
+		/* check without CAS first; it's way cheaper, frequently locked otherwise */
+		expected = pg_atomic_read_u32(&l->lockcount);
+
+		Assert(expected < EXCLUSIVE_LOCK + (1 << 16));
+
+		if (expected != 0)
+			mustwait = true;
+		else if (!pg_atomic_compare_exchange_u32(&l->lockcount,
+												 &expected, EXCLUSIVE_LOCK))
+		{
+			/*
+			 * ok, no can do. Between the pg_atomic_read() above and the
+			 * CAS somebody else acquired the lock.
+			 */
+			mustwait = true;
+			Assert(expected < EXCLUSIVE_LOCK + (1 << 16));
+		}
+		else
+		{
+			/* yipeyyahee */
+			mustwait = false;
+#ifdef LOCK_DEBUG
+			l->owner = MyProc;
+#endif
+			Assert(expected == 0);
+		}
+	}
+	else
+	{
+		/*
+		 * If requested by caller, do an unlocked check first.  This is useful
+		 * if potentially spurious results have a noticeable cost.
+		 */
+		if (double_check)
+		{
+			if (pg_atomic_read_u32(&l->lockcount) >= EXCLUSIVE_LOCK)
+			{
+				mustwait = true;
+				goto out;
+			}
+		}
+
+		/*
+		 * Acquire the share lock unconditionally using an atomic addition. We
+		 * might have to back out again if it turns out somebody else has an
+		 * exclusive lock.
+		 */
+		oldstate = pg_atomic_fetch_add_u32(&l->lockcount, 1);
+
+		if (oldstate >= EXCLUSIVE_LOCK)
+		{
+			/*
+			 * Ok, somebody else holds the lock exclusively. We need to back
+			 * away from the shared lock, since we don't actually hold it right
+			 * now.  Since there's a window between lockcount += 1 and lockcount
+			 * -= 1, the previous exclusive locker could have released and
+			 * another exclusive locker could have seen our +1. We need to
+			 * signal that to the upper layers so they can deal with the race
+			 * condition.
+			 */
+
+			/*
+			 * FIXME: check return value if (double_check), it's not
+			 * spurious if still exclusively locked.
+			 */
+			pg_atomic_fetch_sub_u32(&l->lockcount, 1);
+
+
+			mustwait = true;
+			*potentially_spurious = true;
+#ifdef LWLOCK_STATS
+			lwstats->sh_attempt_backout++;
+#endif
+		}
+		else
+		{
+			/* yipeyyahee */
+			mustwait = false;
+		}
+	}
+
+out:
+	return mustwait;
+}
+
+/*
+ * Wakeup all the lockers that currently have a chance to run.
+ */
+static void
+LWLockWakeup(LWLock *l, LWLockMode mode)
+{
+	volatile LWLock *lock = l;
+	bool		releaseOK;
+	bool		wokeup_somebody = false;
+	dlist_head	wakeup;
+	dlist_mutable_iter iter;
+#ifdef LWLOCK_STATS
+	lwlock_stats   *lwstats;
+	lwstats = get_lwlock_stats_entry(l);
+#endif
+
+	dlist_init(&wakeup);
+
+	/* remove the to-be-awakened PGPROCs from the queue */
+	releaseOK = true;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef LWLOCK_STATS
+	lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+	SpinLockAcquire(&lock->mutex);
+#endif
+
+	/*
+	 * We're still waiting for backends to get scheduled, don't wake them up
+	 * again.
+	 */
+	if (!lock->releaseOK)
+	{
+		SpinLockRelease(&lock->mutex);
+		PRINT_LWDEBUG("LWLockRelease skip releaseok", lock, mode);
+		return;
+	}
+
+	dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
+	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+
+		if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
+			continue;
+
+		dlist_delete(&waiter->lwWaitLink);
+		dlist_push_tail(&wakeup, &waiter->lwWaitLink);
+
+		if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
+		{
+			/*
+			 * Prevent additional wakeups until retryer gets to run. Backends
+			 * that are just waiting for the lock to become free don't retry
+			 * automatically.
+			 */
+			releaseOK = false;
+			/*
+			 * Don't wakeup (further) exclusive locks.
+			 */
+			wokeup_somebody = true;
+		}
+
+		/*
+		 * Once we've woken up an exclusive lock, there's no point in waking
+		 * up anybody else.
+		 */
+		if(waiter->lwWaitMode == LW_EXCLUSIVE)
+			break;
+	}
+	lock->releaseOK = releaseOK;
+
+
+	/* We are done updating shared state of the lock queue. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	dlist_foreach_modify(iter, &wakeup)
+	{
+		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
+
+		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l),  "release waiter");
+		dlist_delete(&waiter->lwWaitLink);
+		pg_write_barrier();
+		waiter->lwWaiting = false;
+		PGSemaphoreUnlock(&waiter->sem);
+	}
+}
+
+/*
+ * Add ourselves to the end of the queue. Mode can be LW_WAIT_UNTIL_FREE here!
+ */
+static inline void
+LWLockQueueSelf(LWLock *l, LWLockMode mode)
+{
+	volatile LWLock *lock = l;
+#ifdef LWLOCK_STATS
+	lwlock_stats   *lwstats;
+	lwstats = get_lwlock_stats_entry(l);
+#endif
+
+	/*
+	 * If we don't have a PGPROC structure, there's no way to wait. This
+	 * should never occur, since MyProc should only be null during shared
+	 * memory initialization.
+	 */
+	if (MyProc == NULL)
+		elog(PANIC, "cannot wait without a PGPROC structure");
+
+	pg_atomic_fetch_add_u32(&lock->nwaiters, 1);
+
+#ifdef LWLOCK_STATS
+	lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+	SpinLockAcquire(&lock->mutex);
+#endif
+
+	if (MyProc->lwWaiting)
+		elog(PANIC, "queueing for lock while waiting on another one");
+
+	MyProc->lwWaiting = true;
+	MyProc->lwWaitMode = mode;
+
+	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
+	if (mode == LW_WAIT_UNTIL_FREE)
+		dlist_push_head((dlist_head *) &lock->waiters, &MyProc->lwWaitLink);
+	else
+		dlist_push_tail((dlist_head *) &lock->waiters, &MyProc->lwWaitLink);
+
+	/* Can release the mutex now */
+	SpinLockRelease(&lock->mutex);
+}
+
+/*
+ * Remove ourselves from the waitlist.  This is used if we queued ourselves
+ * because we thought we needed to sleep but, after further checking, we
+ * discovered that we don't actually need to do so. Somebody else might have
+ * already woken us up though, in that case return false.
+ */
+static inline bool
+LWLockDequeueSelf(LWLock *l)
+{
+	volatile LWLock *lock = l;
+	bool	found = false;
+	dlist_mutable_iter iter;
+
+#ifdef LWLOCK_STATS
+	lwlock_stats *lwstats;
+	lwstats = get_lwlock_stats_entry(l);
+#endif
+
+#ifdef LWLOCK_STATS
+	lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+	SpinLockAcquire(&lock->mutex);
+#endif
+
+	/* need to iterate, somebody else could have unqueued us */
+	dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
+	{
+		PGPROC *proc = dlist_container(PGPROC, lwWaitLink, iter.cur);
+		if (proc == MyProc)
+		{
+			found = true;
+			dlist_delete(&proc->lwWaitLink);
+			break;
+		}
+	}
+
+	/* clear waiting state again, nice for debugging */
+	if (found)
+		MyProc->lwWaiting = false;
+
+	SpinLockRelease(&lock->mutex);
+
+	pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
+	return found;
+}
 
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
@@ -515,14 +904,17 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 {
 	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
-	bool		retry = false;
 	bool		result = true;
 	int			extraWaits = 0;
+	bool		potentially_spurious;
+
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
 #endif
 
-	PRINT_LWDEBUG("LWLockAcquire", lock);
+	AssertArg(mode == LW_SHARED || mode == LW_EXCLUSIVE);
+
+	PRINT_LWDEBUG("LWLockAcquire", lock, mode);
 
 #ifdef LWLOCK_STATS
 	lwstats = get_lwlock_stats_entry(l);
@@ -572,58 +964,77 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	{
 		bool		mustwait;
 
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-#ifdef LWLOCK_STATS
-		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
-#else
-		SpinLockAcquire(&lock->mutex);
-#endif
+		/*
+		 * try to grab the lock the first time, we're not in the waitqueue yet.
+		 */
+		mustwait = LWLockAttemptLock(l, mode, false, &potentially_spurious);
+
+		if (!mustwait)
+		{
+			LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "success");
+			break;				/* got the lock */
+		}
+
+		/*
+		 * Ok, at this point we couldn't grab the lock on the first try. We
+		 * cannot simply queue ourselves to the end of the list and wait to be
+		 * woken up because by now the lock could long have been released.
+		 * Instead add us to the queue and try to grab the lock again. If we
+		 * succeed we need to revert the queuing and be happy, otherwise we
+		 * recheck the lock. If we still couldn't grab it, we know that the
+		 * other lock will see our queue entries when releasing since they
+		 * existed before we checked for the lock.
+		 * FIXME: add note referring to overall notes
+		 */
 
-		/* If retrying, allow LWLockRelease to release waiters again */
-		if (retry)
-			lock->releaseOK = true;
+		/* add to the queue */
+		LWLockQueueSelf(l, mode);
 
-		/* If I can get the lock, do so quickly. */
-		if (mode == LW_EXCLUSIVE)
+		/* we're now guaranteed to be woken up if necessary */
+		mustwait = LWLockAttemptLock(l, mode, false, &potentially_spurious);
+
+		/* ok, grabbed the lock the second time round, need to undo queueing */
+		if (!mustwait)
 		{
-			if (lock->exclusive == 0 && lock->shared == 0)
+#ifdef LWLOCK_STATS
+			lwstats->dequeue_self_count++;
+#endif
+			if (!LWLockDequeueSelf(l))
 			{
-				lock->exclusive++;
-				mustwait = false;
+				/*
+				 * Somebody else dequeued us and has or will wake us up. Wait
+				 * for the correct wakeup, otherwise our ->lwWaiting would get
+				 * reset at some inconvenient point later, and releaseOk
+				 * wouldn't be managed correctly.
+				 */
+				for (;;)
+				{
+					PGSemaphoreLock(&proc->sem, false);
+					if (!proc->lwWaiting)
+						break;
+					extraWaits++;
+				}
+				/*
+				 * Reset releaseOk - if somebody woke us they'll have set it
+				 * to false.
+				 */
+				SpinLockAcquire(&lock->mutex);
+				lock->releaseOK = true;
+				SpinLockRelease(&lock->mutex);
 			}
-			else
-				mustwait = true;
+			PRINT_LWDEBUG("LWLockAcquire success: undo queue", lock, mode);
+			break;
 		}
 		else
 		{
-			if (lock->exclusive == 0)
-			{
-				lock->shared++;
-				mustwait = false;
-			}
-			else
-				mustwait = true;
+			PRINT_LWDEBUG("LWLockAcquire waiting 4", lock, mode);
 		}
 
-		if (!mustwait)
-			break;				/* got the lock */
-
 		/*
-		 * Add myself to wait queue.
-		 *
-		 * If we don't have a PGPROC structure, there's no way to wait. This
-		 * should never occur, since MyProc should only be null during shared
-		 * memory initialization.
+		 * NB: There's no need to deal with spurious lock attempts
+		 * here. Anyone we prevented from acquiring the lock will
+		 * enqueue themselves using the same protocol we used here.
 		 */
-		if (proc == NULL)
-			elog(PANIC, "cannot wait without a PGPROC structure");
-
-		proc->lwWaiting = true;
-		proc->lwWaitMode = mode;
-		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
-
-		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
 
 		/*
 		 * Wait until awakened.
@@ -658,8 +1069,14 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 
 		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "awakened");
 
+		/* not waiting anymore */
+		pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
+
 		/* Now loop back and try to acquire lock again. */
-		retry = true;
+		SpinLockAcquire(&lock->mutex);
+		lock->releaseOK = true;
+		SpinLockRelease(&lock->mutex);
+
 		result = false;
 	}
 
@@ -667,13 +1084,11 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	if (valptr)
 		*valptr = val;
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
-
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = l;
+	held_lwlocks[num_held_lwlocks].lock = l;
+	held_lwlocks[num_held_lwlocks++].mode = mode;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -694,10 +1109,12 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 bool
 LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 {
-	volatile LWLock *lock = l;
 	bool		mustwait;
+	bool		potentially_spurious;
 
-	PRINT_LWDEBUG("LWLockConditionalAcquire", lock);
+	AssertArg(mode == LW_SHARED || mode == LW_EXCLUSIVE);
+
+	PRINT_LWDEBUG("LWLockConditionalAcquire", l, mode);
 
 	/* Ensure we will have room to remember the lock */
 	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
@@ -710,33 +1127,12 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
-	{
-		if (lock->exclusive == 0 && lock->shared == 0)
-		{
-			lock->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
-		{
-			lock->shared++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
+retry:
+	/*
+	 * passing 'true' to check more carefully to avoid potential
+	 * spurious acquisitions
+	 */
+	mustwait = LWLockAttemptLock(l, mode, true, &potentially_spurious);
 
 	if (mustwait)
 	{
@@ -744,14 +1140,29 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 		RESUME_INTERRUPTS();
 		LOG_LWDEBUG("LWLockConditionalAcquire", T_NAME(l), T_ID(l), "failed");
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(l), T_ID(l), mode);
+
+		/*
+		 * We ran into an exclusive lock and might have blocked another
+		 * exclusive lock from taking a shot because it took a time to back
+		 * off. Retry till we are either sure we didn't block somebody (because
+		 * somebody else certainly has the lock) or till we got it.
+		 *
+		 * We cannot rely on the two-step lock-acquisition protocol as in
+		 * LWLockAcquire because we're not using it.
+		 */
+		if (potentially_spurious)
+		{
+			SPIN_DELAY();
+			goto retry;
+		}
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
+		held_lwlocks[num_held_lwlocks].lock = l;
+		held_lwlocks[num_held_lwlocks++].mode = mode;
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(l), T_ID(l), mode);
 	}
-
 	return !mustwait;
 }
 
@@ -776,11 +1187,15 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
+	bool		potentially_spurious_first;
+	bool		potentially_spurious_second;
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
 #endif
 
-	PRINT_LWDEBUG("LWLockAcquireOrWait", lock);
+	Assert(mode == LW_SHARED || mode == LW_EXCLUSIVE);
+
+	PRINT_LWDEBUG("LWLockAcquireOrWait", lock, mode);
 
 #ifdef LWLOCK_STATS
 	lwstats = get_lwlock_stats_entry(l);
@@ -797,79 +1212,57 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
-
-	/* If I can get the lock, do so quickly. */
-	if (mode == LW_EXCLUSIVE)
-	{
-		if (lock->exclusive == 0 && lock->shared == 0)
-		{
-			lock->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
-	else
-	{
-		if (lock->exclusive == 0)
-		{
-			lock->shared++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-	}
+	/*
+	 * NB: We're using nearly the same twice-in-a-row lock acquisition
+	 * protocol as LWLockAcquire(). Check its comments for details.
+	 */
+	mustwait = LWLockAttemptLock(l, mode, false, &potentially_spurious_first);
 
 	if (mustwait)
 	{
-		/*
-		 * Add myself to wait queue.
-		 *
-		 * If we don't have a PGPROC structure, there's no way to wait.  This
-		 * should never occur, since MyProc should only be null during shared
-		 * memory initialization.
-		 */
-		if (proc == NULL)
-			elog(PANIC, "cannot wait without a PGPROC structure");
-
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
+		LWLockQueueSelf(l, LW_WAIT_UNTIL_FREE);
 
-		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		mustwait = LWLockAttemptLock(l, mode, false, &potentially_spurious_second);
 
-		/*
-		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
-		 * wakups, because we share the semaphore with ProcWaitForSignal.
-		 */
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "waiting");
+		if (mustwait)
+		{
+			/*
+			 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
+			 * wakups, because we share the semaphore with ProcWaitForSignal.
+			 */
+			LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "waiting");
 
 #ifdef LWLOCK_STATS
-		lwstats->block_count++;
+			lwstats->block_count++;
 #endif
+			TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+			for (;;)
+			{
+				/* "false" means cannot accept cancel/die interrupt here. */
+				PGSemaphoreLock(&proc->sem, false);
+				if (!proc->lwWaiting)
+					break;
+				extraWaits++;
+			}
+			pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
 
-		for (;;)
+			TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		}
+		else
 		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
+			/* got lock in the second attempt, undo queueing */
+			if (!LWLockDequeueSelf(l))
+			{
+				for (;;)
+				{
+					PGSemaphoreLock(&proc->sem, false);
+					if (!proc->lwWaiting)
+						break;
+					extraWaits++;
+				}
+			}
 		}
-
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
-
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "awakened");
-	}
-	else
-	{
-		/* We are done updating shared state of the lock itself. */
-		SpinLockRelease(&lock->mutex);
 	}
 
 	/*
@@ -887,8 +1280,10 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	}
 	else
 	{
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "succeeded");
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
+		held_lwlocks[num_held_lwlocks].lock = l;
+		held_lwlocks[num_held_lwlocks++].mode = mode;
 		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(l), T_ID(l), mode);
 	}
 
@@ -925,7 +1320,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	lwlock_stats *lwstats;
 #endif
 
-	PRINT_LWDEBUG("LWLockWaitForVar", lock);
+	PRINT_LWDEBUG("LWLockWaitForVar", lock, LW_EXCLUSIVE);
 
 #ifdef LWLOCK_STATS
 	lwstats = get_lwlock_stats_entry(l);
@@ -938,7 +1333,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	 * barrier here as far as the current usage is concerned.  But that might
 	 * not be safe in general.
 	 */
-	if (lock->exclusive == 0)
+	if (pg_atomic_read_u32(&lock->lockcount) == 0)
 		return true;
 
 	/*
@@ -956,21 +1351,16 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		bool		mustwait;
 		uint64		value;
 
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-#ifdef LWLOCK_STATS
-		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
-#else
-		SpinLockAcquire(&lock->mutex);
-#endif
+		mustwait = pg_atomic_read_u32(&lock->lockcount) != 0;
 
-		/* Is the lock now free, and if not, does the value match? */
-		if (lock->exclusive == 0)
-		{
-			result = true;
-			mustwait = false;
-		}
-		else
+		if (mustwait)
 		{
+			/*
+			 * Perform comparison using spinlock as we can't rely on atomic 64
+			 * bit reads/stores.
+			 */
+			SpinLockAcquire(&lock->mutex);
+
 			value = *valp;
 			if (value != oldval)
 			{
@@ -980,21 +1370,62 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 			}
 			else
 				mustwait = true;
+			SpinLockRelease(&lock->mutex);
 		}
+		else
+			mustwait = false;
 
 		if (!mustwait)
 			break;				/* the lock was free or value didn't match */
 
 		/*
-		 * Add myself to wait queue.
+		 * Add myself to wait queue. Note that this is racy, somebody else
+		 * could wakeup before we're finished queuing.
+		 * NB: We're using nearly the same twice-in-a-row lock acquisition
+		 * protocol as LWLockAcquire(). Check its comments for details.
 		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		/* waiters are added to the front of the queue */
-		dlist_push_head((dlist_head *) &lock->waiters, &proc->lwWaitLink);
+		LWLockQueueSelf(l, LW_WAIT_UNTIL_FREE);
 
-		/* Can release the mutex now */
-		SpinLockRelease(&lock->mutex);
+		/*
+		 * We're now guaranteed to be woken up if necessary. Recheck the
+		 * lock's state.
+		 */
+		pg_read_barrier();
+		mustwait = pg_atomic_read_u32(&lock->lockcount) != 0;
+
+		/* ok, grabbed the lock the second time round, need to undo queueing */
+		if (!mustwait)
+		{
+#ifdef LWLOCK_STATS
+			lwstats->dequeue_self_count++;
+#endif
+			if (!LWLockDequeueSelf(l))
+			{
+				/*
+				 * Somebody else dequeued us and has or will wake us up. Wait
+				 * for the correct wakeup, otherwise our ->lwWaiting would get
+				 * reset at some inconvenient point later.
+				 */
+				for (;;)
+				{
+					PGSemaphoreLock(&proc->sem, false);
+					if (!proc->lwWaiting)
+						break;
+					extraWaits++;
+				}
+			}
+			PRINT_LWDEBUG("LWLockWaitForVar undo queue", lock, LW_EXCLUSIVE);
+			break;
+		}
+		else
+		{
+			PRINT_LWDEBUG("LWLockWaitForVar waiting 4", lock, LW_EXCLUSIVE);
+		}
+
+		/*
+		 * NB: Just as in LWLockAcquireCommon() there's no need to deal with
+		 * spurious lock attempts here.
+		 */
 
 		/*
 		 * Wait until awakened.
@@ -1028,13 +1459,11 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
 
 		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "awakened");
+		pg_atomic_fetch_sub_u32(&lock->nwaiters, 1);
 
 		/* Now loop back and check the status of the lock again. */
 	}
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
-
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
 
 	/*
@@ -1075,8 +1504,7 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	/* Acquire mutex.  Time spent holding mutex should be short! */
 	SpinLockAcquire(&lock->mutex);
 
-	/* we should hold the lock */
-	Assert(lock->exclusive == 1);
+	Assert(pg_atomic_read_u32(&lock->lockcount) >= EXCLUSIVE_LOCK);
 
 	/* Update the lock's value */
 	*valp = val;
@@ -1102,7 +1530,7 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
-	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
+	dlist_foreach_modify(iter, &wakeup)
 	{
 		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
 		dlist_delete(&waiter->lwWaitLink);
@@ -1120,22 +1548,23 @@ void
 LWLockRelease(LWLock *l)
 {
 	volatile LWLock *lock = l;
-	dlist_head	wakeup;
-	dlist_mutable_iter iter;
+	LWLockMode	mode;
+	uint32		lockcount;
+	bool		check_waiters;
+	bool		have_waiters = false;
 	int			i;
 
-	dlist_init(&wakeup);
-
-	PRINT_LWDEBUG("LWLockRelease", lock);
-
 	/*
 	 * Remove lock from list of locks held.  Usually, but not always, it will
 	 * be the latest-acquired lock; so search array backwards.
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (l == held_lwlocks[i])
+		if (l == held_lwlocks[i].lock)
+		{
+			mode = held_lwlocks[i].mode;
 			break;
+		}
 	}
 	if (i < 0)
 		elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
@@ -1143,75 +1572,40 @@ LWLockRelease(LWLock *l)
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&lock->mutex);
+	PRINT_LWDEBUG("LWLockRelease", lock, mode);
 
-	/* Release my hold on lock */
-	if (lock->exclusive > 0)
-		lock->exclusive--;
+	/* Release my hold on lock, both are a full barrier */
+	if (mode == LW_EXCLUSIVE)
+		lockcount = pg_atomic_sub_fetch_u32(&lock->lockcount, EXCLUSIVE_LOCK);
 	else
-	{
-		Assert(lock->shared > 0);
-		lock->shared--;
-	}
+		lockcount = pg_atomic_sub_fetch_u32(&lock->lockcount, 1);
+
+	/* nobody else can have that kind of lock */
+	Assert(lockcount < EXCLUSIVE_LOCK);
 
 	/*
-	 * See if I need to awaken any waiters.  If I released a non-last shared
-	 * hold, there cannot be anything to do.  Also, do not awaken any waiters
-	 * if someone has already awakened waiters that haven't yet acquired the
-	 * lock.
+	 * Anybody we need to wakeup needs to have started queueing before
+	 * we removed ourselves from the queue and the __sync_ operations
+	 * above are full barriers.
 	 */
-	if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
-	{
-		/*
-		 * Remove the to-be-awakened PGPROCs from the queue.
-		 */
-		bool		releaseOK = true;
-		bool		wokeup_somebody = false;
-
-		dlist_foreach_modify(iter, (dlist_head *) &lock->waiters)
-		{
-			PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
-
-			if (wokeup_somebody && waiter->lwWaitMode == LW_EXCLUSIVE)
-				continue;
 
-			dlist_delete(&waiter->lwWaitLink);
-			dlist_push_tail(&wakeup, &waiter->lwWaitLink);
-
-			/*
-			 * Prevent additional wakeups until retryer gets to
-			 * run. Backends that are just waiting for the lock to become
-			 * free don't retry automatically.
-			 */
-			if (waiter->lwWaitMode != LW_WAIT_UNTIL_FREE)
-			{
-				releaseOK = false;
-				wokeup_somebody = true;
-			}
-
-			if(waiter->lwWaitMode == LW_EXCLUSIVE)
-				break;
-		}
-		lock->releaseOK = releaseOK;
-	}
+	if (pg_atomic_read_u32(&lock->nwaiters) > 0)
+		have_waiters = true;
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&lock->mutex);
-
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(l), T_ID(l));
+	/* grant permission to run, even if a spurious share lock increases lockcount */
+	if (mode == LW_EXCLUSIVE && have_waiters)
+		check_waiters = true;
+	/* nobody has this locked anymore, potential exclusive lockers get a chance */
+	else if (lockcount == 0 && have_waiters)
+		check_waiters = true;
+	/* nobody queued or not free */
+	else
+		check_waiters = false;
 
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	dlist_foreach_modify(iter, (dlist_head *) &wakeup)
+	if (check_waiters)
 	{
-		PGPROC *waiter = dlist_container(PGPROC, lwWaitLink, iter.cur);
-		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
-		dlist_delete(&waiter->lwWaitLink);
-		pg_write_barrier();
-		waiter->lwWaiting = false;
-		PGSemaphoreUnlock(&waiter->sem);
+		PRINT_LWDEBUG("LWLockRelease releasing", lock, mode);
+		LWLockWakeup(l, mode);
 	}
 
 	/*
@@ -1237,7 +1631,7 @@ LWLockReleaseAll(void)
 	{
 		HOLD_INTERRUPTS();		/* match the upcoming RESUME_INTERRUPTS */
 
-		LWLockRelease(held_lwlocks[num_held_lwlocks - 1]);
+		LWLockRelease(held_lwlocks[num_held_lwlocks - 1].lock);
 	}
 }
 
@@ -1245,8 +1639,8 @@ LWLockReleaseAll(void)
 /*
  * LWLockHeldByMe - test whether my process currently holds a lock
  *
- * This is meant as debug support only.  We do not distinguish whether the
- * lock is held shared or exclusive.
+ * This is meant as debug support only.  We currently do not distinguish
+ * whether the lock is held shared or exclusive.
  */
 bool
 LWLockHeldByMe(LWLock *l)
@@ -1255,7 +1649,7 @@ LWLockHeldByMe(LWLock *l)
 
 	for (i = 0; i < num_held_lwlocks; i++)
 	{
-		if (held_lwlocks[i] == l)
+		if (held_lwlocks[i].lock == l)
 			return true;
 	}
 	return false;
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1bca57b..f0abb35 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -16,6 +16,7 @@
 
 #include "lib/ilist.h"
 #include "storage/s_lock.h"
+#include "port/atomics.h"
 
 struct PGPROC;
 
@@ -48,10 +49,14 @@ typedef struct LWLock
 {
 	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
 	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
+
+	pg_atomic_uint32 lockcount;	/* state of exlusive/nonexclusive lockers */
+	pg_atomic_uint32 nwaiters;	/* number of waiters */
 	int			tranche;		/* tranche ID */
 	dlist_head	waiters;		/* list of waiting PGPROCs */
+#ifdef LOCK_DEBUG
+	struct PGPROC *owner;
+#endif
 } LWLock;
 
 /*
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index f49d17f..f217bd6 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -906,6 +906,8 @@ extern int	tas_sema(volatile slock_t *lock);
 #define S_LOCK_FREE(lock)	(*(lock) == 0)
 #endif	 /* S_LOCK_FREE */
 
+#include "storage/barrier.h"
+
 #if !defined(S_UNLOCK)
 /*
  * Our default implementation of S_UNLOCK is essentially *(lock) = 0.  This
-- 
2.0.0.rc2.4.g1dc51c6.dirty

In reply to: Andres Freund (#94)
1 attachment(s)
Re: better atomics - v0.6

23.09.2014, 00:01, Andres Freund kirjoitti:

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Most notable changes:
* Lots of comment improvements
* code moved out of storage/ into port/
* separated the s_lock.h changes into its own commit
* merged i386/amd64 into one file
* fixed lots of the little details Amit noticed
* fixed lots of XXX/FIXMEs
* rebased to today's master
* tested various gcc/msvc versions
* extended the regression tests
* ...

The patches:
0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm
animal dingo) with this patch but regression tests failed due to:

/export/home/os/postgresql/src/test/regress/regress.so: symbol
pg_write_barrier_impl: referenced symbol not found

which turns out to be caused by a leftover PG_ prefix in ifdefs for
HAVE_GCC__ATOMIC_INT64_CAS. Removing the PG_ prefix fixed the build and
regression tests. Attached a patch to strip the invalid prefix.

0002: Implement s_lock.h support ontop the atomics API. Existing
implementations continue to be used unless
FORCE_ATOMICS_BASED_SPINLOCKS is defined

Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS
- both builds passed regression tests.

0003-0005: Not proposed for review here. Just included because code
actually using the atomics make testing them easier.

I'll look at these patches later.

/ Oskari

Attachments:

0001-atomics-fix-ifdefs-for-gcc-atomics.patchtext/x-diff; name=0001-atomics-fix-ifdefs-for-gcc-atomics.patchDownload
>From 207b02cd7ee6d38b92b51195a951713639f0d738 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa <os@ohmu.fi>
Date: Tue, 23 Sep 2014 13:28:51 +0300
Subject: [PATCH] atomics: fix ifdefs for gcc atomics

---
 src/include/port/atomics/generic-gcc.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h
index 9b0c636..f82af01 100644
--- a/src/include/port/atomics/generic-gcc.h
+++ b/src/include/port/atomics/generic-gcc.h
@@ -40,19 +40,19 @@
  * definitions where possible, and use this only as a fallback.
  */
 #if !defined(pg_memory_barrier_impl)
-#	if defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#	if defined(HAVE_GCC__ATOMIC_INT64_CAS)
 #		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
 #	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
 #		define pg_memory_barrier_impl()		__sync_synchronize()
 #	endif
 #endif /* !defined(pg_memory_barrier_impl) */
 
-#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
 /* acquire semantics include read barrier semantics */
 #		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
 #endif
 
-#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__PG_ATOMIC_INT64_CAS)
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
 /* release semantics include write barrier semantics */
 #		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
 #endif
@@ -101,7 +101,7 @@ typedef struct pg_atomic_uint64
 	volatile uint64 value;
 } pg_atomic_uint64;
 
-#endif /* defined(HAVE_GCC__PG_ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS) */
+#endif /* defined(HAVE_GCC__ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS) */
 
 /*
  * Implementation follows. Inlined or directly included from atomics.c
-- 
1.8.4.1

#96Andres Freund
andres@2ndquadrant.com
In reply to: Oskari Saarenmaa (#95)
Re: better atomics - v0.6

On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:

23.09.2014, 00:01, Andres Freund kirjoitti:

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Most notable changes:
* Lots of comment improvements
* code moved out of storage/ into port/
* separated the s_lock.h changes into its own commit
* merged i386/amd64 into one file
* fixed lots of the little details Amit noticed
* fixed lots of XXX/FIXMEs
* rebased to today's master
* tested various gcc/msvc versions
* extended the regression tests
* ...

The patches:
0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
dingo) with this patch but regression tests failed due to:

/export/home/os/postgresql/src/test/regress/regress.so: symbol
pg_write_barrier_impl: referenced symbol not found

which turns out to be caused by a leftover PG_ prefix in ifdefs for
HAVE_GCC__ATOMIC_INT64_CAS. Removing the PG_ prefix fixed the build and
regression tests. Attached a patch to strip the invalid prefix.

Gah. Stupid last minute changes... Thanks for diagnosing. Will
integrate.

0002: Implement s_lock.h support ontop the atomics API. Existing
implementations continue to be used unless
FORCE_ATOMICS_BASED_SPINLOCKS is defined

Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS -
both builds passed regression tests.

Cool.

0003-0005: Not proposed for review here. Just included because code
actually using the atomics make testing them easier.

I'll look at these patches later.

Cool. Although I'm not proposing them to be integrated as-is...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Andres Freund
andres@2ndquadrant.com
In reply to: Oskari Saarenmaa (#95)
Re: better atomics - v0.6

Hi,

On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:

23.09.2014, 00:01, Andres Freund kirjoitti:

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Most notable changes:
* Lots of comment improvements
* code moved out of storage/ into port/
* separated the s_lock.h changes into its own commit
* merged i386/amd64 into one file
* fixed lots of the little details Amit noticed
* fixed lots of XXX/FIXMEs
* rebased to today's master
* tested various gcc/msvc versions
* extended the regression tests
* ...

The patches:
0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
dingo) with this patch but regression tests failed due to:

Btw, if you could try sun studio it'd be great. I wrote the support for
it blindly, and I'd be surprised if I got it right on the first try.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#97)
1 attachment(s)
Re: better atomics - v0.6

23.09.2014, 15:18, Andres Freund kirjoitti:

On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:

23.09.2014, 00:01, Andres Freund kirjoitti:

The patches:
0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
dingo) with this patch but regression tests failed due to:

Btw, if you could try sun studio it'd be great. I wrote the support for
it blindly, and I'd be surprised if I got it right on the first try.

I just installed Solaris Studio 12.3 and tried compiling this:

"../../../../src/include/port/atomics/generic-sunpro.h", line 54: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 77: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 79:
#if-less #endif
"../../../../src/include/port/atomics/generic-sunpro.h", line 81:
#if-less #endif

atomic_add_64 and atomic_add_32 don't return anything (the
atomic_add_*_nv variants return the new value) and there were a few
extra #endifs. Regression tests pass after applying the attached patch
which defines PG_HAS_ATOMIC_ADD_FETCH_U32.

There doesn't seem to be a fallback for defining pg_atomic_fetch_add
based on pg_atomic_add_fetch so pg_atomic_add_fetch now gets implemented
using pg_atomic_compare_exchange.

Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS
with these patches:

"../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL
does not have native spinlock support on this platform....

atomics/generic.h would implement atomic flags using operations exposed
by atomics/generic-sunpro.h, but atomics/fallback.h is included before
it and it defines functions for flag operations which s_lock.h doesn't
want to use.

/ Oskari

Attachments:

0001-atomics-fix-atomic-add-for-sunpro-and-drop-invalid-e.patchtext/x-patch; name=0001-atomics-fix-atomic-add-for-sunpro-and-drop-invalid-e.patchDownload
>From 42a5bbbab0c8f42c6014ebe12c9963b371168866 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa <os@ohmu.fi>
Date: Tue, 23 Sep 2014 23:53:46 +0300
Subject: [PATCH] atomics: fix atomic add for sunpro and drop invalid #endifs

Solaris Studio compiler has an atomic add operation that returns the new
value, the one with no _nv suffix doesn't return anything.
---
 src/include/port/atomics/generic-sunpro.h | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/include/port/atomics/generic-sunpro.h b/src/include/port/atomics/generic-sunpro.h
index 10ac70d..fedc099 100644
--- a/src/include/port/atomics/generic-sunpro.h
+++ b/src/include/port/atomics/generic-sunpro.h
@@ -47,14 +47,12 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	return ret;
 }
 
-#define PG_HAS_ATOMIC_FETCH_ADD_U32
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
 STATIC_IF_INLINE uint32
-pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 {
-	return atomic_add_32(&ptr->value, add_);
+	return atomic_add_32_nv(&ptr->value, add_);
 }
-#endif
-
 
 #define PG_HAS_ATOMIC_COMPARE_EXCHANGE_U64
 static inline bool
@@ -70,12 +68,11 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 	return ret;
 }
 
-#define PG_HAS_ATOMIC_FETCH_ADD_U64
+#define PG_HAS_ATOMIC_ADD_FETCH_U64
 STATIC_IF_INLINE uint64
-pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 {
-	return atomic_add_64(&ptr->value, add_);
+	return atomic_add_64_nv(&ptr->value, add_);
 }
-#endif
 
 #endif
-- 
1.8.4.1

#99Andres Freund
andres@2ndquadrant.com
In reply to: Oskari Saarenmaa (#98)
Re: better atomics - v0.6

On 2014-09-24 00:27:25 +0300, Oskari Saarenmaa wrote:

23.09.2014, 15:18, Andres Freund kirjoitti:

On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:

23.09.2014, 00:01, Andres Freund kirjoitti:

The patches:
0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
dingo) with this patch but regression tests failed due to:

Btw, if you could try sun studio it'd be great. I wrote the support for
it blindly, and I'd be surprised if I got it right on the first try.

I just installed Solaris Studio 12.3 and tried compiling this:

Cool.

"../../../../src/include/port/atomics/generic-sunpro.h", line 54: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 77: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 79: #if-less
#endif
"../../../../src/include/port/atomics/generic-sunpro.h", line 81: #if-less
#endif

atomic_add_64 and atomic_add_32 don't return anything (the atomic_add_*_nv
variants return the new value) and there were a few extra #endifs.
Regression tests pass after applying the attached patch which defines
PG_HAS_ATOMIC_ADD_FETCH_U32.

Thanks for the fixes!

Hm. I think then it's better to simply not implement addition and rely
on cmpxchg.

Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS
with these patches:

"../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL
does not have native spinlock support on this platform....

atomics/generic.h would implement atomic flags using operations exposed by
atomics/generic-sunpro.h, but atomics/fallback.h is included before it and
it defines functions for flag operations which s_lock.h doesn't want to use.

Right. It should check not just for flag support, but also for 32bit
atomics. Easily fixable.

I'll send a new version with your fixes incorporated tomorrow.

Thanks for looking into this! Very helpful.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#94)
Re: better atomics - v0.6

On 09/23/2014 12:01 AM, Andres Freund wrote:

Hi,

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Thanks! I agree it looks good. Some random comments after a quick
read-through:

There are some spurious whitespace changes in spin.c.

+ * These files can provide the full set of atomics or can do pretty much
+ * nothing if all the compilers commonly used on these platforms provide
+ * useable generics. It will often make sense to define memory barrier
+ * semantics in those, since e.g. the compiler intrinsics for x86 memory
+ * barriers can't know we don't need read/write barriers anything more than a
+ * compiler barrier.

I didn't understand the latter sentence.

+ * Don't add an inline assembly of the actual atomic operations if all the
+ * common implementations of your platform provide intrinsics. Intrinsics are
+ * much easier to understand and potentially support more architectures.

What about uncommon implementations, then? I think we need to provide
inline assembly if any supported implementation on the platform does not
provide intrinsics, otherwise we fall back to emulation. Or is that
exactly what you're envisioning we do?

I believe such a situation hasn't come up on the currently supported
platforms, so we don't necessary have to have a solution for that, but
the comment doesn't feel quite right either.

+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))

Would be better to call this AssertAlignment, to make it clear that this
is an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps
move this to c.h where other assertions are - this seems generally useful.

+ * Spinloop delay -

Spurious dash.

+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+

This was a surprise to me, I don't recall discussion of an
"fetch-add-until" operation, and hadn't actually ever heard of it
before. None of the subsequent patches seem to use it either. Can we
just leave this out?

s/the the/the/

+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+static inline void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	/*
+	 * 64 bit writes aren't safe on all platforms. In the generic
+	 * implementation implement them as an atomic exchange. That might store a
+	 * 0, but only if the prev. value also was a 0.
+	 */
+	pg_atomic_exchange_u64_impl(ptr, val);
+}
+#endif

Why is 0 special?

+ * fail or suceed, but always return the old value

s/suceed/succeed/. Add a full stop to end.

+	/*
+	 * Can't run the test under the semaphore emulation, it doesn't handle
+	 * checking edge cases well.
+	 */
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+	test_atomic_flag();
+#endif

Huh? Is the semaphore emulation broken, then?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#100)
Re: better atomics - v0.6

On 2014-09-24 14:59:19 +0300, Heikki Linnakangas wrote:

On 09/23/2014 12:01 AM, Andres Freund wrote:

Hi,

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Thanks! I agree it looks good. Some random comments after a quick
read-through:

There are some spurious whitespace changes in spin.c.

Huh. Unsure how they crept in. Will fix.

+ * These files can provide the full set of atomics or can do pretty much
+ * nothing if all the compilers commonly used on these platforms provide
+ * useable generics. It will often make sense to define memory barrier
+ * semantics in those, since e.g. the compiler intrinsics for x86 memory
+ * barriers can't know we don't need read/write barriers anything more than a
+ * compiler barrier.

I didn't understand the latter sentence.

Hm. The background here is that postgres doesn't need memory barriers
affect sse et al registers - which is why read/write barriers currently
can be defined to be empty on TSO architectures like x86. But e.g. gcc's
barrier intrinsics don't assume that, so they do a mfence or lock xaddl
for read/write barriers. Which isn't cheap.

Will try to find a way to rephrase.

+ * Don't add an inline assembly of the actual atomic operations if all the
+ * common implementations of your platform provide intrinsics. Intrinsics are
+ * much easier to understand and potentially support more architectures.

What about uncommon implementations, then? I think we need to provide inline
assembly if any supported implementation on the platform does not provide
intrinsics, otherwise we fall back to emulation. Or is that exactly what
you're envisioning we do?

Yes, that's what I'm envisioning. E.g. old versions of solaris studio
don't provide compiler intrinsics and also don't provide reliable ways
to barrier semantics. It doesn't seem worthwile to add a inline assembly
version for that special case.

I believe such a situation hasn't come up on the currently supported
platforms, so we don't necessary have to have a solution for that, but the
comment doesn't feel quite right either.

It's the reason I added a inline assembly version of atomics for x86
gcc...

+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
+	Assert(TYPEALIGN((uintptr_t)(ptr), bndr))

Would be better to call this AssertAlignment, to make it clear that this is
an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps move
this to c.h where other assertions are - this seems generally useful.

Sounds good.

+ * Spinloop delay -

Spurious dash.

Probably should rather add a bit more explanation...

+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+

This was a surprise to me, I don't recall discussion of an "fetch-add-until"
operation, and hadn't actually ever heard of it before.

It was included from the first version on, and I'd mentioned it a couple
times.

None of the subsequent patches seem to use it either. Can we just
leave this out?

I have a further patch (lockless buffer header manipulation) that uses
it to manipulate the usagecount. I can add it back later if you feel
strongly about it.

s/the the/the/

+#ifndef PG_HAS_ATOMIC_WRITE_U64
+#define PG_HAS_ATOMIC_WRITE_U64
+static inline void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	/*
+	 * 64 bit writes aren't safe on all platforms. In the generic
+	 * implementation implement them as an atomic exchange. That might store a
+	 * 0, but only if the prev. value also was a 0.
+	 */
+	pg_atomic_exchange_u64_impl(ptr, val);
+}
+#endif

Why is 0 special?

That bit should actually be in the read_u64 implementation, not write...

If you do a compare and exchange you have to provide an 'expected'
value. So, if you do a compare and exchange and the previous value is 0
the value will be written superflously - but the actual value won't change.

+	/*
+	 * Can't run the test under the semaphore emulation, it doesn't handle
+	 * checking edge cases well.
+	 */
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+	test_atomic_flag();
+#endif

Huh? Is the semaphore emulation broken, then?

No, it's not. The semaphore emulation doesn't implement
pg_atomic_unlocked_test_flag() (as there's no sane way to do that with
semaphores that I know of). Instead it always returns true. Which
disturbs the test. I guess I'll expand that comment...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#101)
Re: better atomics - v0.6

On 09/24/2014 03:37 PM, Andres Freund wrote:

+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+

This was a surprise to me, I don't recall discussion of an "fetch-add-until"
operation, and hadn't actually ever heard of it before.

It was included from the first version on, and I'd mentioned it a couple
times.

There doesn't seem to be any hardware implementations of that in the
patch. Is there any architecture that has an instruction or compiler
intrinsic for that?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#102)
Re: better atomics - v0.6

On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:

On 09/24/2014 03:37 PM, Andres Freund wrote:

+/*
+ * pg_fetch_add_until_u32 - saturated addition to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
+							  uint32 until)
+{
+	CHECK_POINTER_ALIGNMENT(ptr, 4);
+	return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
+}
+

This was a surprise to me, I don't recall discussion of an "fetch-add-until"
operation, and hadn't actually ever heard of it before.

It was included from the first version on, and I'd mentioned it a couple
times.

There doesn't seem to be any hardware implementations of that in the patch.
Is there any architecture that has an instruction or compiler intrinsic for
that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#103)
Re: better atomics - v0.6

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:

There doesn't seem to be any hardware implementations of that in the patch.
Is there any architecture that has an instruction or compiler intrinsic for
that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics". I doubt either that this exists
natively anywhere, or that it's so useful that we should expect platforms
to have efficient implementations.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Noname
andres@anarazel.de
In reply to: Tom Lane (#104)
Re: better atomics - v0.6

On 2014-09-24 12:44:09 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:

There doesn't seem to be any hardware implementations of that in the patch.
Is there any architecture that has an instruction or compiler intrinsic for
that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics".

I fail to see why the existance of a wrapper around compare-exchange
(which is one of the primitives we'd agreed upon) runs counter to
the agreement that we'll only rely on a limited number of atomics on the
hardware level?

I doubt either that this exists
natively anywhere, or ethat it's so useful that we should expect platforms
to have efficient implementations.

It's useful for my work to get rid of most LockBufHdr() calls (to
manipulate usagecount locklessly). That's why I added it. We can delay
it till that patch is ready, but I don't really see the benefit.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Noname (#105)
Re: better atomics - v0.6

On 09/24/2014 07:57 PM, Andres Freund wrote:

On 2014-09-24 12:44:09 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:

There doesn't seem to be any hardware implementations of that in the patch.
Is there any architecture that has an instruction or compiler intrinsic for
that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics".

I fail to see why the existance of a wrapper around compare-exchange
(which is one of the primitives we'd agreed upon) runs counter to
the agreement that we'll only rely on a limited number of atomics on the
hardware level?

It might be a useful function, but if there's no hardware implementation
for it, it doesn't belong in atomics.h. We don't want to turn it into a
general library of useful little functions.

I doubt either that this exists
natively anywhere, or ethat it's so useful that we should expect platforms
to have efficient implementations.

Googling around, ARM seems to have a QADD instruction that does that.
But AFAICS it's not an atomic instruction that you could use for
synchronization purposes, just a regular instruction.

It's useful for my work to get rid of most LockBufHdr() calls (to
manipulate usagecount locklessly). That's why I added it. We can delay
it till that patch is ready, but I don't really see the benefit.

Yeah, please leave it out for now, we can argue about it later. Even if
we want it in the future, it would be just dead, untested code now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#106)
Re: better atomics - v0.6

On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:

On 09/24/2014 07:57 PM, Andres Freund wrote:

On 2014-09-24 12:44:09 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:

There doesn't seem to be any hardware implementations of that in the patch.
Is there any architecture that has an instruction or compiler intrinsic for
that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics".

I fail to see why the existance of a wrapper around compare-exchange
(which is one of the primitives we'd agreed upon) runs counter to
the agreement that we'll only rely on a limited number of atomics on the
hardware level?

It might be a useful function, but if there's no hardware implementation for
it, it doesn't belong in atomics.h. We don't want to turn it into a general
library of useful little functions.

Uh. It belongs there because it *atomically* does a saturated add?
That's precisely what atomics.h is about, so I don't see why it'd belong
anywhere else.

I doubt either that this exists
natively anywhere, or ethat it's so useful that we should expect platforms
to have efficient implementations.

Googling around, ARM seems to have a QADD instruction that does that. But
AFAICS it's not an atomic instruction that you could use for synchronization
purposes, just a regular instruction.

On ARM - and many other LL/SC architectures - you can implement it
atomically using LL/SC. In pseudocode:

while (true)
{
load_linked(mem, reg);
if (reg + add < limit)
reg = reg + add;
else
reg = limit;
if (store_conditional(mem, reg))
break
}

That's how pretty much *all* atomic instructions work on such architectures.

It's useful for my work to get rid of most LockBufHdr() calls (to
manipulate usagecount locklessly). That's why I added it. We can delay
it till that patch is ready, but I don't really see the benefit.

Yeah, please leave it out for now, we can argue about it later. Even if we
want it in the future, it would be just dead, untested code now.

Ok, will remove it for now.

I won't repost a version with it removed, as removing a function as the
only doesn't seem to warrant reposting it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#106)
Re: better atomics - v0.6

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 09/24/2014 07:57 PM, Andres Freund wrote:

On 2014-09-24 12:44:09 -0400, Tom Lane wrote:

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics".

I fail to see why the existance of a wrapper around compare-exchange
(which is one of the primitives we'd agreed upon) runs counter to
the agreement that we'll only rely on a limited number of atomics on the
hardware level?

It might be a useful function, but if there's no hardware implementation
for it, it doesn't belong in atomics.h. We don't want to turn it into a
general library of useful little functions.

Note that the spinlock code separates s_lock.h (hardware implementations)
from spin.h (a hardware-independent abstraction layer). Perhaps there's
room for a similar separation here. I tend to agree with Heikki that
wrappers around compare-exchange ought not be conflated with
compare-exchange itself, even if there might theoretically be
architectures where the wrapper function could be implemented directly.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#108)
Re: better atomics - v0.6

On 2014-09-24 14:28:18 -0400, Tom Lane wrote:

Note that the spinlock code separates s_lock.h (hardware implementations)
from spin.h (a hardware-independent abstraction layer). Perhaps there's
room for a similar separation here.

Luckily that separation exists ;). The hardware dependant bits are in
different files than the platform independent functionality on top of
them (atomics/generic.h).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#107)
1 attachment(s)
Re: better atomics - v0.6

On 2014-09-24 20:27:39 +0200, Andres Freund wrote:

On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:
I won't repost a version with it removed, as removing a function as the
only doesn't seem to warrant reposting it.

I've fixed (thanks Alvaro!) some minor additional issues besides the
removal and addressing earlier comments from Heikki:
* removal of two remaining non-ascii copyright signs. The only non ascii
name character that remains is Alvaro's name in the commit message.
* additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE
* there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split
into different configure tests and thus has a different name. A later
patch entirely removes that reference, which is why I'd missed that...

Heikki has marked the patch as 'ready for commiter' in the commitfest
(conditional on his remarks being addressed) and I agree. There seems to
be little benefit in waiting further. There *definitely* will be some
platform dependant issues, but that won't change by waiting longer.

I plan to commit this quite soon unless somebody protests really
quickly.

Sorting out the issues on platforms I don't have access to based on
buildfarm feedback will take a while given how infrequent some of the
respective animals run... But that's just life.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-a-basic-atomic-ops-API-abstracting-away-platform.patchtext/x-patch; charset=iso-8859-1Download
>From b64d92f1a5602c55ee8b27a7ac474f03b7aee340 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 25 Sep 2014 23:49:05 +0200
Subject: [PATCH] Add a basic atomic ops API abstracting away
 platform/architecture details.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.

For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was actually slightly faster than the old purely
spinlock using implementation. That's important because it reduces the
fear of regressing older platforms when improving the scalability for
new ones.

The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integers. If the platform
efficiently supports atomic 64 bit unsigned integers those are also
provided.

To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.

The implementation for various versions of gcc, msvc and sun studio have
been tested. Additional existing stub implementations for
* Intel icc
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.

As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.

Author: Andres Freund with contributions from Oskari Saarenmaa
Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera
Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com,
    20131015123303.GH5300@awork2.anarazel.de,
    20131028205522.GI20248@awork2.anarazel.de
---
 config/c-compiler.m4                             |  98 +++++
 configure                                        | 274 ++++++++++--
 configure.in                                     |  34 +-
 src/backend/main/main.c                          |   1 +
 src/backend/port/Makefile                        |   2 +-
 src/backend/port/atomics.c                       | 127 ++++++
 src/backend/storage/lmgr/spin.c                  |   7 +-
 src/include/c.h                                  |  21 +-
 src/include/pg_config.h.in                       |  24 +-
 src/include/pg_config.h.win32                    |   3 +
 src/include/pg_config_manual.h                   |   8 +
 src/include/port/atomics.h                       | 531 +++++++++++++++++++++++
 src/include/port/atomics/arch-arm.h              |  25 ++
 src/include/port/atomics/arch-hppa.h             |  17 +
 src/include/port/atomics/arch-ia64.h             |  26 ++
 src/include/port/atomics/arch-ppc.h              |  26 ++
 src/include/port/atomics/arch-x86.h              | 241 ++++++++++
 src/include/port/atomics/fallback.h              | 132 ++++++
 src/include/port/atomics/generic-acc.h           |  99 +++++
 src/include/port/atomics/generic-gcc.h           | 236 ++++++++++
 src/include/port/atomics/generic-msvc.h          | 103 +++++
 src/include/port/atomics/generic-sunpro.h        |  74 ++++
 src/include/port/atomics/generic-xlc.h           | 103 +++++
 src/include/port/atomics/generic.h               | 387 +++++++++++++++++
 src/include/storage/barrier.h                    | 155 +------
 src/include/storage/s_lock.h                     |   9 +-
 src/test/regress/expected/lock.out               |   8 +
 src/test/regress/input/create_function_1.source  |   5 +
 src/test/regress/output/create_function_1.source |   4 +
 src/test/regress/regress.c                       | 239 ++++++++++
 src/test/regress/sql/lock.sql                    |   5 +
 31 files changed, 2816 insertions(+), 208 deletions(-)
 create mode 100644 src/backend/port/atomics.c
 create mode 100644 src/include/port/atomics.h
 create mode 100644 src/include/port/atomics/arch-arm.h
 create mode 100644 src/include/port/atomics/arch-hppa.h
 create mode 100644 src/include/port/atomics/arch-ia64.h
 create mode 100644 src/include/port/atomics/arch-ppc.h
 create mode 100644 src/include/port/atomics/arch-x86.h
 create mode 100644 src/include/port/atomics/fallback.h
 create mode 100644 src/include/port/atomics/generic-acc.h
 create mode 100644 src/include/port/atomics/generic-gcc.h
 create mode 100644 src/include/port/atomics/generic-msvc.h
 create mode 100644 src/include/port/atomics/generic-sunpro.h
 create mode 100644 src/include/port/atomics/generic-xlc.h
 create mode 100644 src/include/port/atomics/generic.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 802f553..2cf74bb 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -300,3 +300,101 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_PROG_CC_LDFLAGS_OPT
+
+# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(char),
+# and define HAVE_GCC__SYNC_CHAR_TAS
+#
+# NB: There are platforms where test_and_set is available but compare_and_swap
+# is not, so test this separately.
+# NB: Some platforms only do 32bit tas, others only do 8bit tas. Test both.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_CHAR_TAS],
+[AC_CACHE_CHECK(for builtin __sync char locking functions, pgac_cv_gcc_sync_char_tas,
+[AC_TRY_LINK([],
+  [char lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_char_tas="yes"],
+  [pgac_cv_gcc_sync_char_tas="no"])])
+if test x"$pgac_cv_gcc_sync_char_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_CHAR_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(char *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_CHAR_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_TAS
+# -------------------------
+# Check if the C compiler understands __sync_lock_test_and_set(),
+# and define HAVE_GCC__SYNC_INT32_TAS
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_TAS],
+[AC_CACHE_CHECK(for builtin __sync int32 locking functions, pgac_cv_gcc_sync_int32_tas,
+[AC_TRY_LINK([],
+  [int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);],
+  [pgac_cv_gcc_sync_int32_tas="yes"],
+  [pgac_cv_gcc_sync_int32_tas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_TAS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_TAS
+
+# PGAC_HAVE_GCC__SYNC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 32bit
+# types, and define HAVE_GCC__SYNC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __sync int32 atomic operations, pgac_cv_gcc_sync_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);],
+  [pgac_cv_gcc_sync_int32_cas="yes"],
+  [pgac_cv_gcc_sync_int32_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT32_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int *, int, int).])
+fi])# PGAC_HAVE_GCC__SYNC_INT32_CAS
+
+# PGAC_HAVE_GCC__SYNC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __sync_compare_and_swap() for 64bit
+# types, and define HAVE_GCC__SYNC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __sync int64 atomic operations, pgac_cv_gcc_sync_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);],
+  [pgac_cv_gcc_sync_int64_cas="yes"],
+  [pgac_cv_gcc_sync_int64_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64).])
+fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+# -------------------------
+# Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
+# types, and define HAVE_GCC__ATOMIC_INT32_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT32_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int32 atomic operations, pgac_cv_gcc_atomic_int32_cas,
+[AC_TRY_LINK([],
+  [int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int32_cas="yes"],
+  [pgac_cv_gcc_atomic_int32_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT32_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+
+# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
+# -------------------------
+# Check if the C compiler understands __atomic_compare_exchange_n() for 64bit
+# types, and define HAVE_GCC__ATOMIC_INT64_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT64_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int64 atomic operations, pgac_cv_gcc_atomic_int64_cas,
+[AC_TRY_LINK([],
+  [PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);],
+  [pgac_cv_gcc_atomic_int64_cas="yes"],
+  [pgac_cv_gcc_atomic_int64_cas="no"])])
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).])
+fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
diff --git a/configure b/configure
index dac8e49..f0580ce 100755
--- a/configure
+++ b/configure
@@ -802,6 +802,7 @@ enable_nls
 with_pgport
 enable_rpath
 enable_spinlocks
+enable_atomics
 enable_debug
 enable_profiling
 enable_coverage
@@ -1470,6 +1471,7 @@ Optional Features:
   --disable-rpath         do not embed shared library search path in
                           executables
   --disable-spinlocks     do not use spinlocks
+  --disable-atomics       do not use atomic operations
   --enable-debug          build with debugging symbols (-g)
   --enable-profiling      build with profiling enabled
   --enable-coverage       build with coverage testing instrumentation
@@ -3146,6 +3148,33 @@ fi
 
 
 #
+# Atomic operations
+#
+
+
+# Check whether --enable-atomics was given.
+if test "${enable_atomics+set}" = set; then :
+  enableval=$enable_atomics;
+  case $enableval in
+    yes)
+      :
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --enable-atomics option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  enable_atomics=yes
+
+fi
+
+
+
+#
 # --enable-debug adds -g to compiler flags
 #
 
@@ -8349,6 +8378,17 @@ $as_echo "$as_me: WARNING:
 *** Not using spinlocks will cause poor performance." >&2;}
 fi
 
+if test "$enable_atomics" = yes; then
+
+$as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
+
+else
+  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
+*** Not using atomic operations will cause poor performance." >&5
+$as_echo "$as_me: WARNING:
+*** Not using atomic operations will cause poor performance." >&2;}
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     { $as_echo "$as_me:${as_lineno-$LINENO}: checking for library containing gss_init_sec_context" >&5
@@ -9123,7 +9163,7 @@ fi
 done
 
 
-for ac_header in crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -12154,40 +12194,6 @@ fi
 done
 
 
-{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin locking functions" >&5
-$as_echo_n "checking for builtin locking functions... " >&6; }
-if ${pgac_cv_gcc_int_atomics+:} false; then :
-  $as_echo_n "(cached) " >&6
-else
-  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
-/* end confdefs.h.  */
-
-int
-main ()
-{
-int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);
-  ;
-  return 0;
-}
-_ACEOF
-if ac_fn_c_try_link "$LINENO"; then :
-  pgac_cv_gcc_int_atomics="yes"
-else
-  pgac_cv_gcc_int_atomics="no"
-fi
-rm -f core conftest.err conftest.$ac_objext \
-    conftest$ac_exeext conftest.$ac_ext
-fi
-{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_int_atomics" >&5
-$as_echo "$pgac_cv_gcc_int_atomics" >&6; }
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-
-$as_echo "#define HAVE_GCC_INT_ATOMICS 1" >>confdefs.h
-
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -13711,6 +13717,204 @@ _ACEOF
 fi
 
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync char locking functions" >&5
+$as_echo_n "checking for builtin __sync char locking functions... " >&6; }
+if ${pgac_cv_gcc_sync_char_tas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+char lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_char_tas="yes"
+else
+  pgac_cv_gcc_sync_char_tas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_char_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_char_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_char_tas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_CHAR_TAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int32 locking functions" >&5
+$as_echo_n "checking for builtin __sync int32 locking functions... " >&6; }
+if ${pgac_cv_gcc_sync_int32_tas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int lock = 0;
+   __sync_lock_test_and_set(&lock, 1);
+   __sync_lock_release(&lock);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_tas="yes"
+else
+  pgac_cv_gcc_sync_int32_tas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_tas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_tas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_tas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_TAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int32 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   __sync_val_compare_and_swap(&val, 0, 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int32_cas="yes"
+else
+  pgac_cv_gcc_sync_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __sync int64 atomic operations" >&5
+$as_echo_n "checking for builtin __sync int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_sync_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT64_TYPE) 37);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_sync_int64_cas="yes"
+else
+  pgac_cv_gcc_sync_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_sync_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_sync_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__SYNC_INT64_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int32 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int32 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int32_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int val = 0;
+   int expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int32_cas="yes"
+else
+  pgac_cv_gcc_atomic_int32_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int32_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int32_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int32_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT32_CAS 1" >>confdefs.h
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for builtin __atomic int64 atomic operations" >&5
+$as_echo_n "checking for builtin __atomic int64 atomic operations... " >&6; }
+if ${pgac_cv_gcc_atomic_int64_cas+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+PG_INT64_TYPE val = 0;
+   PG_INT64_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_gcc_atomic_int64_cas="yes"
+else
+  pgac_cv_gcc_atomic_int64_cas="no"
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_gcc_atomic_int64_cas" >&5
+$as_echo "$pgac_cv_gcc_atomic_int64_cas" >&6; }
+if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
+
+$as_echo "#define HAVE_GCC__ATOMIC_INT64_CAS 1" >>confdefs.h
+
+fi
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/configure.in b/configure.in
index 1392277..527b076 100644
--- a/configure.in
+++ b/configure.in
@@ -179,6 +179,12 @@ PGAC_ARG_BOOL(enable, spinlocks, yes,
               [do not use spinlocks])
 
 #
+# Atomic operations
+#
+PGAC_ARG_BOOL(enable, atomics, yes,
+              [do not use atomic operations])
+
+#
 # --enable-debug adds -g to compiler flags
 #
 PGAC_ARG_BOOL(enable, debug, no,
@@ -936,6 +942,13 @@ else
 *** Not using spinlocks will cause poor performance.])
 fi
 
+if test "$enable_atomics" = yes; then
+  AC_DEFINE(HAVE_ATOMICS, 1, [Define to 1 if you want to use atomics if available.])
+else
+  AC_MSG_WARN([
+*** Not using atomic operations will cause poor performance.])
+fi
+
 if test "$with_gssapi" = yes ; then
   if test "$PORTNAME" != "win32"; then
     AC_SEARCH_LIBS(gss_init_sec_context, [gssapi_krb5 gss 'gssapi -lkrb5 -lcrypto'], [],
@@ -1003,7 +1016,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
@@ -1467,17 +1480,6 @@ fi
 AC_CHECK_FUNCS([strtoll strtoq], [break])
 AC_CHECK_FUNCS([strtoull strtouq], [break])
 
-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
-  [int lock = 0;
-   __sync_lock_test_and_set(&lock, 1);
-   __sync_lock_release(&lock);],
-  [pgac_cv_gcc_int_atomics="yes"],
-  [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
-  AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-
 # Lastly, restore full LIBS list and check for readline/libedit symbols
 LIBS="$LIBS_including_readline"
 
@@ -1746,6 +1748,14 @@ AC_CHECK_TYPES([int8, uint8, int64, uint64], [], [],
 # C, but is missing on some old platforms.
 AC_CHECK_TYPES(sig_atomic_t, [], [], [#include <signal.h>])
 
+# Check for various atomic operations now that we have checked how to declare
+# 64bit integers.
+PGAC_HAVE_GCC__SYNC_CHAR_TAS
+PGAC_HAVE_GCC__SYNC_INT32_TAS
+PGAC_HAVE_GCC__SYNC_INT32_CAS
+PGAC_HAVE_GCC__SYNC_INT64_CAS
+PGAC_HAVE_GCC__ATOMIC_INT32_CAS
+PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
 if test "$PORTNAME" != "win32"
 then
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 413c982..c51b391 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -30,6 +30,7 @@
 #include "common/username.h"
 #include "postmaster/postmaster.h"
 #include "storage/barrier.h"
+#include "storage/s_lock.h"
 #include "storage/spin.h"
 #include "tcop/tcopprot.h"
 #include "utils/help_config.h"
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index 15ea873..c6b1d20 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
+OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
new file mode 100644
index 0000000..32255b0
--- /dev/null
+++ b/src/backend/port/atomics.c
@@ -0,0 +1,127 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.c
+ *	   Non-Inline parts of the atomics implementation
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/port/atomics.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/*
+ * We want the functions below to be inline; but if the compiler doesn't
+ * support that, fall back on providing them as regular functions.	See
+ * STATIC_IF_INLINE in c.h.
+ */
+#define ATOMICS_INCLUDE_DEFINITIONS
+
+#include "port/atomics.h"
+#include "storage/spin.h"
+
+#ifdef PG_HAVE_MEMORY_BARRIER_EMULATION
+void
+pg_spinlock_barrier(void)
+{
+	S_LOCK(&dummy_spinlock);
+	S_UNLOCK(&dummy_spinlock);
+}
+#endif
+
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+#ifndef HAVE_SPINLOCKS
+	/*
+	 * NB: If we're using semaphore based TAS emulation, be careful to use a
+	 * separate set of semaphores. Otherwise we'd get in trouble if a atomic
+	 * var would be manipulated while spinlock is held.
+	 */
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+}
+
+bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return TAS((slock_t *) &ptr->sema);
+}
+
+void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	S_UNLOCK((slock_t *) &ptr->sema);
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	StaticAssertStmt(sizeof(ptr->sema) >= sizeof(slock_t),
+					 "size mismatch of atomic_flag vs slock_t");
+
+	/*
+	 * If we're using semaphore based atomic flags, be careful about nested
+	 * usage of atomics while a spinlock is held.
+	 */
+#ifndef HAVE_SPINLOCKS
+	s_init_lock_sema((slock_t *) &ptr->sema, true);
+#else
+	SpinLockInit((slock_t *) &ptr->sema);
+#endif
+	ptr->value = val_;
+}
+
+bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool ret;
+	/*
+	 * Do atomic op under a spinlock. It might look like we could just skip
+	 * the cmpxchg if the lock isn't available, but that'd just emulate a
+	 * 'weak' compare and swap. I.e. one that allows spurious failures. Since
+	 * several algorithms rely on a strong variant and that is efficiently
+	 * implementable on most major architectures let's emulate it here as
+	 * well.
+	 */
+	SpinLockAcquire((slock_t *) &ptr->sema);
+
+	/* perform compare/exchange logic*/
+	ret = ptr->value == *expected;
+	*expected = ptr->value;
+	if (ret)
+		ptr->value = newval;
+
+	/* and release lock */
+	SpinLockRelease((slock_t *) &ptr->sema);
+
+	return ret;
+}
+
+uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 oldval;
+	SpinLockAcquire((slock_t *) &ptr->sema);
+	oldval = ptr->value;
+	ptr->value += add_;
+	SpinLockRelease((slock_t *) &ptr->sema);
+	return oldval;
+}
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index e5a9d35..de3d922 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -67,7 +67,7 @@ SpinlockSemas(void)
 int
 SpinlockSemas(void)
 {
-	return NUM_SPINLOCK_SEMAPHORES;
+	return NUM_SPINLOCK_SEMAPHORES + NUM_ATOMICS_SEMAPHORES;
 }
 
 /*
@@ -77,8 +77,9 @@ extern void
 SpinlockSemaInit(PGSemaphore spinsemas)
 {
 	int			i;
+	int			nsemas = SpinlockSemas();
 
-	for (i = 0; i < NUM_SPINLOCK_SEMAPHORES; ++i)
+	for (i = 0; i < nsemas; ++i)
 		PGSemaphoreCreate(&spinsemas[i]);
 	SpinlockSemaArray = spinsemas;
 }
@@ -88,7 +89,7 @@ SpinlockSemaInit(PGSemaphore spinsemas)
  */
 
 void
-s_init_lock_sema(volatile slock_t *lock)
+s_init_lock_sema(volatile slock_t *lock, bool nested)
 {
 	static int	counter = 0;
 
diff --git a/src/include/c.h b/src/include/c.h
index 2ceaaf6..ce38d78 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -582,6 +582,7 @@ typedef NameData *Name;
 #define AssertMacro(condition)	((void)true)
 #define AssertArg(condition)
 #define AssertState(condition)
+#define AssertPointerAlignment(ptr, bndr)	((void)true)
 #define Trap(condition, errorType)
 #define TrapMacro(condition, errorType) (true)
 
@@ -592,6 +593,7 @@ typedef NameData *Name;
 #define AssertMacro(p)	((void) assert(p))
 #define AssertArg(condition) assert(condition)
 #define AssertState(condition) assert(condition)
+#define AssertPointerAlignment(ptr, bndr)	((void)true)
 #else							/* USE_ASSERT_CHECKING && !FRONTEND */
 
 /*
@@ -628,8 +630,15 @@ typedef NameData *Name;
 
 #define AssertState(condition) \
 		Trap(!(condition), "BadState")
-#endif   /* USE_ASSERT_CHECKING && !FRONTEND */
 
+/*
+ * Check that `ptr' is `bndr' aligned.
+ */
+#define AssertPointerAlignment(ptr, bndr) \
+	Trap(TYPEALIGN(bndr, (uintptr_t)(ptr)) != (uintptr_t)(ptr), \
+		 "UnalignedPointer")
+
+#endif   /* USE_ASSERT_CHECKING && !FRONTEND */
 
 /*
  * Macros to support compile-time assertion checks.
@@ -856,12 +865,22 @@ typedef NameData *Name;
  * The header must also declare the functions' prototypes, protected by
  * !PG_USE_INLINE.
  */
+
+/* declarations which are only visible when not inlining and in the .c file */
 #ifdef PG_USE_INLINE
 #define STATIC_IF_INLINE static inline
 #else
 #define STATIC_IF_INLINE
 #endif   /* PG_USE_INLINE */
 
+/* declarations which are marked inline when inlining, extern otherwise */
+#ifdef PG_USE_INLINE
+#define STATIC_IF_INLINE_DECLARE static inline
+#else
+#define STATIC_IF_INLINE_DECLARE extern
+#endif   /* PG_USE_INLINE */
+
+
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
  * ----------------------------------------------------------------
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 1ce37c8..ddcf4b0 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -87,6 +87,12 @@
 /* Define to 1 if you have the `append_history' function. */
 #undef HAVE_APPEND_HISTORY
 
+/* Define to 1 if you want to use atomics if available. */
+#undef HAVE_ATOMICS
+
+/* Define to 1 if you have the <atomic.h> header file. */
+#undef HAVE_ATOMIC_H
+
 /* Define to 1 if you have the `cbrt' function. */
 #undef HAVE_CBRT
 
@@ -173,8 +179,24 @@
 /* Define to 1 if your compiler understands __FUNCTION__. */
 #undef HAVE_FUNCNAME__FUNCTION
 
+/* Define to 1 if you have __atomic_compare_exchange_n(int *, int *, int). */
+#undef HAVE_GCC__ATOMIC_INT32_CAS
+
+/* Define to 1 if you have __atomic_compare_exchange_n(int64 *, int *, int64).
+   */
+#undef HAVE_GCC__ATOMIC_INT64_CAS
+
+/* Define to 1 if you have __sync_lock_test_and_set(char *) and friends. */
+#undef HAVE_GCC__SYNC_CHAR_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int *, int, int). */
+#undef HAVE_GCC__SYNC_INT32_CAS
+
 /* Define to 1 if you have __sync_lock_test_and_set(int *) and friends. */
-#undef HAVE_GCC_INT_ATOMICS
+#undef HAVE_GCC__SYNC_INT32_TAS
+
+/* Define to 1 if you have __sync_compare_and_swap(int64 *, int64, int64). */
+#undef HAVE_GCC__SYNC_INT64_CAS
 
 /* Define to 1 if you have the `getaddrinfo' function. */
 #undef HAVE_GETADDRINFO
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index bce67a2..05941e6 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -334,6 +334,9 @@
 /* Define to 1 if you have spinlocks. */
 #define HAVE_SPINLOCKS 1
 
+/* Define to 1 if you have atomics. */
+#define HAVE_ATOMICS 1
+
 /* Define to 1 if you have the `srandom' function. */
 /* #undef HAVE_SRANDOM */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index d78f38e..195fb26 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -65,6 +65,14 @@
 #define NUM_SPINLOCK_SEMAPHORES		1024
 
 /*
+ * When we have neither spinlocks nor atomic operations support we're
+ * implementing atomic operations on top of spinlock on top of semaphores. To
+ * be safe against atomic operations while holding a spinlock separate
+ * semaphores have to be used.
+ */
+#define NUM_ATOMICS_SEMAPHORES		64
+
+/*
  * Define this if you want to allow the lo_import and lo_export SQL
  * functions to be executed by ordinary users.  By default these
  * functions are only available to the Postgres superuser.  CAUTION:
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
new file mode 100644
index 0000000..6a68aa2
--- /dev/null
+++ b/src/include/port/atomics.h
@@ -0,0 +1,531 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomics.h
+ *	  Atomic operations.
+ *
+ * Hardware and compiler dependent functions for manipulating memory
+ * atomically and dealing with cache coherency. Used to implement locking
+ * facilities and lockless algorithms/data structures.
+ *
+ * To bring up postgres on a platform/compiler at the very least
+ * implementations for the following operations should be provided:
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
+ * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
+ *
+ * There exist generic, hardware independent, implementations for several
+ * compilers which might be sufficient, although possibly not optimal, for a
+ * new platform. If no such generic implementation is available spinlocks (or
+ * even OS provided semaphores) will be used to implement the API.
+ *
+ * Implement the _u64 variantes if and only if your platform can use them
+ * efficiently (and obviously correctly).
+ *
+ * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
+ * whenever possible. Writing correct code using these facilities is hard.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMICS_H
+#define ATOMICS_H
+
+#define INSIDE_ATOMICS_H
+
+#include <limits.h>
+
+/*
+ * First a set of architecture specific files is included.
+ *
+ * These files can provide the full set of atomics or can do pretty much
+ * nothing if all the compilers commonly used on these platforms provide
+ * useable generics.
+ *
+ * Don't add an inline assembly of the actual atomic operations if all the
+ * common implementations of your platform provide intrinsics. Intrinsics are
+ * much easier to understand and potentially support more architectures.
+ *
+ * It will often make sense to define memory barrier semantics here, since
+ * e.g. generic compiler intrinsics for x86 memory barriers can't know that
+ * postgres doesn't need x86 read/write barriers do anything more than a
+ * compiler barrier.
+ *
+ */
+#if defined(__arm__) || defined(__arm) || \
+	defined(__aarch64__) || defined(__aarch64)
+#	include "port/atomics/arch-arm.h"
+#elif defined(__i386__) || defined(__i386) || defined(__x86_64__)
+#	include "port/atomics/arch-x86.h"
+#elif defined(__ia64__) || defined(__ia64)
+#	include "port/atomics/arch-ia64.h"
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+#	include "port/atomics/arch-ppc.h"
+#elif defined(__hppa) || defined(__hppa__)
+#	include "port/atomics/arch-hppa.h"
+#endif
+
+/*
+ * Compiler specific, but architecture independent implementations.
+ *
+ * Provide architecture independent implementations of the atomic
+ * facilities. At the very least compiler barriers should be provided, but a
+ * full implementation of
+ * * pg_compiler_barrier(), pg_write_barrier(), pg_read_barrier()
+ * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
+ * using compiler intrinsics are a good idea.
+ */
+/* gcc or compatible, including clang and icc */
+#if defined(__GNUC__) || defined(__INTEL_COMPILER)
+#	include "port/atomics/generic-gcc.h"
+#elif defined(WIN32_ONLY_COMPILER)
+#	include "port/atomics/generic-msvc.h"
+#elif defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
+#	include "port/atomics/generic-acc.h"
+#elif defined(__SUNPRO_C) && !defined(__GNUC__)
+#	include "port/atomics/generic-sunpro.h"
+#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
+#	include "port/atomics/generic-xlc.h"
+#else
+/*
+ * Unsupported compiler, we'll likely use slower fallbacks... At least
+ * compiler barriers should really be provided.
+ */
+#endif
+
+/*
+ * Provide a full fallback of the pg_*_barrier(), pg_atomic**_flag and
+ * pg_atomic_*_u32 APIs for platforms without sufficient spinlock and/or
+ * atomics support. In the case of spinlock backed atomics the emulation is
+ * expected to be efficient, although less so than native atomics support.
+ */
+#include "port/atomics/fallback.h"
+
+/*
+ * Provide additional operations using supported infrastructure. These are
+ * expected to be efficient if the underlying atomic operations are efficient.
+ */
+#include "port/atomics/generic.h"
+
+/*
+ * Provide declarations for all functions here - on most platforms static
+ * inlines are used and these aren't necessary, but when static inline is
+ * unsupported these will be external functions.
+ */
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
+
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+															 uint32 *expected, uint32 newval);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
+STATIC_IF_INLINE_DECLARE uint32 pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr);
+STATIC_IF_INLINE_DECLARE void pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval);
+STATIC_IF_INLINE_DECLARE bool pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+															 uint64 *expected, uint64 newval);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_);
+STATIC_IF_INLINE_DECLARE uint64 pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_);
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+
+/*
+ * pg_compiler_barrier - prevent the compiler from moving code across
+ *
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ */
+#define pg_compiler_barrier()	pg_compiler_barrier_impl()
+
+/*
+ * pg_memory_barrier - prevent the CPU from reordering memory access
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ */
+#define pg_memory_barrier()	pg_memory_barrier_impl()
+
+/*
+ * pg_(read|write)_barrier - prevent the CPU from reordering memory access
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.	Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+#define pg_read_barrier()	pg_read_barrier_impl()
+#define pg_write_barrier()	pg_write_barrier_impl()
+
+/*
+ * Spinloop delay - Allow CPU to relax in busy loops
+ */
+#define pg_spin_delay()	pg_spin_delay_impl()
+
+/*
+ * The following functions are wrapper functions around the platform specific
+ * implementation of the atomic operations performing common checks.
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+/*
+ * pg_atomic_init_flag - initialize atomic flag.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_flag(volatile pg_atomic_flag *ptr)
+{
+	AssertPointerAlignment(ptr, sizeof(*ptr));
+
+	pg_atomic_init_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_test_and_set_flag - TAS()
+ *
+ * Returns true if the flag has successfully been set, false otherwise.
+ *
+ * Acquire (including read barrier) semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr)
+{
+	AssertPointerAlignment(ptr, sizeof(*ptr));
+
+	return pg_atomic_test_set_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_unlocked_test_flag - Check if the lock is free
+ *
+ * Returns true if the flag currently is not set, false otherwise.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE bool
+pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr)
+{
+	AssertPointerAlignment(ptr, sizeof(*ptr));
+
+	return pg_atomic_unlocked_test_flag_impl(ptr);
+}
+
+/*
+ * pg_atomic_clear_flag - release lock set by TAS()
+ *
+ * Release (including write barrier) semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_clear_flag(volatile pg_atomic_flag *ptr)
+{
+	AssertPointerAlignment(ptr, sizeof(*ptr));
+
+	pg_atomic_clear_flag_impl(ptr);
+}
+
+
+/*
+ * pg_atomic_init_u32 - initialize atomic variable
+ *
+ * Has to be done before any concurrent usage..
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	AssertPointerAlignment(ptr, 4);
+
+	pg_atomic_init_u32_impl(ptr, val);
+}
+
+/*
+ * pg_atomic_write_u32 - unlocked write to atomic variable.
+ *
+ * The write is guaranteed to succeed as a whole, i.e. it's not possible to
+ * observe a partial write for any reader.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_read_u32_impl(ptr);
+}
+
+/*
+ * pg_atomic_read_u32 - unlocked read from atomic variable.
+ *
+ * The read is guaranteed to return a value as it has been written by this or
+ * another process at some point in the past. There's however no cache
+ * coherency interaction guaranteeing the value hasn't since been written to
+ * again.
+ *
+ * No barrier semantics.
+ */
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_write_u32(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	AssertPointerAlignment(ptr, 4);
+
+	pg_atomic_write_u32_impl(ptr, val);
+}
+
+/*
+ * pg_atomic_exchange_u32 - exchange newval with current value
+ *
+ * Returns the old value of 'ptr' before the swap.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_exchange_u32(volatile pg_atomic_uint32 *ptr, uint32 newval)
+{
+	AssertPointerAlignment(ptr, 4);
+
+	return pg_atomic_exchange_u32_impl(ptr, newval);
+}
+
+/*
+ * pg_atomic_compare_exchange_u32 - CAS operation
+ *
+ * Atomically compare the current value of ptr with *expected and store newval
+ * iff ptr and *expected have the same value. The current value of *ptr will
+ * always be stored in *expected.
+ *
+ * Return true if values have been exchanged, false otherwise.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
+							   uint32 *expected, uint32 newval)
+{
+	AssertPointerAlignment(ptr, 4);
+	AssertPointerAlignment(expected, 4);
+
+	return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
+}
+
+/*
+ * pg_atomic_fetch_add_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_fetch_add_u32_impl(ptr, add_);
+}
+
+/*
+ * pg_atomic_fetch_sub_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr before the arithmetic operation. Note that
+ * sub_ may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	AssertPointerAlignment(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_);
+}
+
+/*
+ * pg_atomic_fetch_and_u32 - atomically bit-and and_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_and_u32(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_fetch_and_u32_impl(ptr, and_);
+}
+
+/*
+ * pg_atomic_fetch_or_u32 - atomically bit-or or_ with variable
+ *
+ * Returns the the value of ptr before the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_fetch_or_u32(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_fetch_or_u32_impl(ptr, or_);
+}
+
+/*
+ * pg_atomic_add_fetch_u32 - atomically add to variable
+ *
+ * Returns the the value of ptr after the arithmetic operation.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_add_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	AssertPointerAlignment(ptr, 4);
+	return pg_atomic_add_fetch_u32_impl(ptr, add_);
+}
+
+/*
+ * pg_atomic_sub_fetch_u32 - atomically subtract from variable
+ *
+ * Returns the the value of ptr after the arithmetic operation. Note that sub_
+ * may not be INT_MIN due to platform limitations.
+ *
+ * Full barrier semantics.
+ */
+STATIC_IF_INLINE uint32
+pg_atomic_sub_fetch_u32(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	AssertPointerAlignment(ptr, 4);
+	Assert(sub_ != INT_MIN);
+	return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
+}
+
+/* ----
+ * The 64 bit operations have the same semantics as their 32bit counterparts
+ * if they are available. Check the corresponding 32bit function for
+ * documentation.
+ * ----
+ */
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+STATIC_IF_INLINE_DECLARE void
+pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	AssertPointerAlignment(ptr, 8);
+
+	pg_atomic_init_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_read_u64(volatile pg_atomic_uint64 *ptr)
+{
+	AssertPointerAlignment(ptr, 8);
+	return pg_atomic_read_u64_impl(ptr);
+}
+
+STATIC_IF_INLINE void
+pg_atomic_write_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	AssertPointerAlignment(ptr, 8);
+	pg_atomic_write_u64_impl(ptr, val);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 newval)
+{
+	AssertPointerAlignment(ptr, 8);
+
+	return pg_atomic_exchange_u64_impl(ptr, newval);
+}
+
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr,
+							   uint64 *expected, uint64 newval)
+{
+	AssertPointerAlignment(ptr, 8);
+	AssertPointerAlignment(expected, 8);
+	return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_add_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	AssertPointerAlignment(ptr, 8);
+	return pg_atomic_fetch_add_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_sub_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	AssertPointerAlignment(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_and_u64(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	AssertPointerAlignment(ptr, 8);
+	return pg_atomic_fetch_and_u64_impl(ptr, and_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_fetch_or_u64(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	AssertPointerAlignment(ptr, 8);
+	return pg_atomic_fetch_or_u64_impl(ptr, or_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_add_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	AssertPointerAlignment(ptr, 8);
+	return pg_atomic_add_fetch_u64_impl(ptr, add_);
+}
+
+STATIC_IF_INLINE uint64
+pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	AssertPointerAlignment(ptr, 8);
+	Assert(sub_ != -INT64CONST(0x7FFFFFFFFFFFFFFF) - 1);
+	return pg_atomic_sub_fetch_u64_impl(ptr, sub_);
+}
+
+#endif /* PG_HAVE_64_BIT_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#undef INSIDE_ATOMICS_H
+
+#endif /* ATOMICS_H */
diff --git a/src/include/port/atomics/arch-arm.h b/src/include/port/atomics/arch-arm.h
new file mode 100644
index 0000000..ec7665f
--- /dev/null
+++ b/src/include/port/atomics/arch-arm.h
@@ -0,0 +1,25 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-arm.h
+ *	  Atomic operations considerations specific to ARM
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-arm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/*
+ * 64 bit atomics on arm are implemented using kernel fallbacks and might be
+ * slow, so disable entirely for now.
+ * XXX: We might want to change that at some point for AARCH64
+ */
+#define PG_DISABLE_64_BIT_ATOMICS
diff --git a/src/include/port/atomics/arch-hppa.h b/src/include/port/atomics/arch-hppa.h
new file mode 100644
index 0000000..0c6d31a
--- /dev/null
+++ b/src/include/port/atomics/arch-hppa.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-hppa.h
+ *	  Atomic operations considerations specific to HPPA
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-hppa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* HPPA doesn't do either read or write reordering */
+#define pg_memory_barrier_impl()		pg_compiler_barrier_impl()
diff --git a/src/include/port/atomics/arch-ia64.h b/src/include/port/atomics/arch-ia64.h
new file mode 100644
index 0000000..92a088e
--- /dev/null
+++ b/src/include/port/atomics/arch-ia64.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-ia64.h
+ *	  Atomic operations considerations specific to intel itanium
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-ia64.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#if defined(__INTEL_COMPILER)
+#	define pg_memory_barrier_impl()		__mf()
+#elif defined(__GNUC__)
+#	define pg_memory_barrier_impl()		__asm__ __volatile__ ("mf" : : : "memory")
+#elif defined(__hpux)
+#	define pg_memory_barrier_impl()		_Asm_mf()
+#endif
diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h
new file mode 100644
index 0000000..250b6c4
--- /dev/null
+++ b/src/include/port/atomics/arch-ppc.h
@@ -0,0 +1,26 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-ppc.h
+ *	  Atomic operations considerations specific to PowerPC
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-ppc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(__GNUC__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#endif
diff --git a/src/include/port/atomics/arch-x86.h b/src/include/port/atomics/arch-x86.h
new file mode 100644
index 0000000..b2127ad
--- /dev/null
+++ b/src/include/port/atomics/arch-x86.h
@@ -0,0 +1,241 @@
+/*-------------------------------------------------------------------------
+ *
+ * arch-x86.h
+ *	  Atomic operations considerations specific to intel x86
+ *
+ * Note that we actually require a 486 upwards because the 386 doesn't have
+ * support for xadd and cmpxchg. Given that the 386 isn't supported anywhere
+ * anymore that's not much of restriction luckily.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * src/include/port/atomics/arch-x86.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Both 32 and 64 bit x86 do not allow loads to be reordered with other loads,
+ * or stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ *
+ * "lock; addl" has worked for longer than "mfence". It's also rumored to be
+ * faster in many scenarios
+ */
+
+#if defined(__INTEL_COMPILER)
+#define pg_memory_barrier_impl()		_mm_mfence()
+#elif defined(__GNUC__) && (defined(__i386__) || defined(__i386))
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory", "cc")
+#elif defined(__GNUC__) && defined(__x86_64__)
+#define pg_memory_barrier_impl()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc")
+#endif
+
+#define pg_read_barrier_impl()		pg_compiler_barrier_impl()
+#define pg_write_barrier_impl()		pg_compiler_barrier_impl()
+
+/*
+ * Provide implementation for atomics using inline assembly on x86 gcc. It's
+ * nice to support older gcc's and the compare/exchange implementation here is
+ * actually more efficient than the * __sync variant.
+ */
+#if defined(HAVE_ATOMICS)
+
+#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	volatile char value;
+} pg_atomic_flag;
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+/*
+ * It's too complicated to write inline asm for 64bit types on 32bit and the
+ * 468 can't do it.
+ */
+#ifdef __x86_64__
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+#endif
+
+#endif /* defined(HAVE_ATOMICS) */
+#endif /* defined(__GNUC__) && !defined(__INTEL_COMPILER) */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAVE_SPIN_DELAY)
+/*
+ * This sequence is equivalent to the PAUSE instruction ("rep" is
+ * ignored by old IA32 processors if the following instruction is
+ * not a string operation); the IA-32 Architecture Software
+ * Developer's Manual, Vol. 3, Section 7.7.2 describes why using
+ * PAUSE in the inner loop of a spin lock is necessary for good
+ * performance:
+ *
+ *     The PAUSE instruction improves the performance of IA-32
+ *     processors supporting Hyper-Threading Technology when
+ *     executing spin-wait loops and other routines where one
+ *     thread is accessing a shared lock or semaphore in a tight
+ *     polling loop. When executing a spin-wait loop, the
+ *     processor can suffer a severe performance penalty when
+ *     exiting the loop because it detects a possible memory order
+ *     violation and flushes the core processor's pipeline. The
+ *     PAUSE instruction provides a hint to the processor that the
+ *     code sequence is a spin-wait loop. The processor uses this
+ *     hint to avoid the memory order violation and prevent the
+ *     pipeline flush. In addition, the PAUSE instruction
+ *     de-pipelines the spin-wait loop to prevent it from
+ *     consuming execution resources excessively.
+ */
+#if defined(__INTEL_COMPILER)
+#define PG_HAVE_SPIN_DELAY
+static inline
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#elif defined(__GNUC__)
+#define PG_HAVE_SPIN_DELAY
+static __inline__ void
+pg_spin_delay_impl(void)
+{
+	__asm__ __volatile__(
+		" rep; nop			\n");
+}
+#elif defined(WIN32_ONLY_COMPILER) && defined(__x86_64__)
+#define PG_HAVE_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	_mm_pause();
+}
+#elif defined(WIN32_ONLY_COMPILER)
+#define PG_HAVE_SPIN_DELAY
+static __forceinline void
+pg_spin_delay_impl(void)
+{
+	/* See comment for gcc code. Same code, MASM syntax */
+	__asm rep nop;
+}
+#endif
+#endif /* !defined(PG_HAVE_SPIN_DELAY) */
+
+
+#if defined(HAVE_ATOMICS)
+
+/* inline assembly implementation for gcc */
+#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
+
+#define PG_HAVE_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	register char _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgb	%0,%1	\n"
+:		"+q"(_res), "+m"(ptr->value)
+:
+:		"memory");
+	return _res == 0;
+}
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgl	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddl	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return res;
+}
+
+#ifdef __x86_64__
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	char	ret;
+
+	/*
+	 * Perform cmpxchg and use the zero flag which it implicitly sets when
+	 * equal to measure the success.
+	 */
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	cmpxchgq	%4,%5	\n"
+		"   setz		%2		\n"
+:		"=a" (*expected), "=m"(ptr->value), "=r" (ret)
+:		"a" (*expected), "r" (newval), "m"(ptr->value)
+:		"memory", "cc");
+	return (bool) ret;
+}
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 res;
+	__asm__ __volatile__(
+		"	lock				\n"
+		"	xaddq	%0,%1		\n"
+:		"=q"(res), "=m"(ptr->value)
+:		"0" (add_), "m"(ptr->value)
+:		"memory", "cc");
+	return res;
+}
+
+#endif /* __x86_64__ */
+
+#endif /* defined(__GNUC__) && !defined(__INTEL_COMPILER) */
+
+#endif /* HAVE_ATOMICS */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/fallback.h b/src/include/port/atomics/fallback.h
new file mode 100644
index 0000000..fb97814
--- /dev/null
+++ b/src/include/port/atomics/fallback.h
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * fallback.h
+ *    Fallback for platforms without spinlock and/or atomics support. Slower
+ *    than native atomics support, but not unusably slow.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics/fallback.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+#ifndef pg_memory_barrier_impl
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#define PG_HAVE_MEMORY_BARRIER_EMULATION
+
+extern void pg_spinlock_barrier(void);
+#define pg_memory_barrier_impl pg_spinlock_barrier
+#endif
+
+/*
+ * If we have atomics implementation for this platform fall back to providing
+ * the atomics API using a spinlock to protect the internal state. Possibly
+ * the spinlock implementation uses semaphores internally...
+ *
+ * We have to be a bit careful here, as it's not guaranteed that atomic
+ * variables are mapped to the same address in every process (e.g. dynamic
+ * shared memory segments). We can't just hash the address and use that to map
+ * to a spinlock. Instead assign a spinlock on initialization of the atomic
+ * variable.
+ */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) && !defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+
+#define PG_HAVE_ATOMIC_FLAG_SIMULATION
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+
+typedef struct pg_atomic_flag
+{
+	/*
+	 * To avoid circular includes we can't use s_lock as a type here. Instead
+	 * just reserve enough space for all spinlock types. Some platforms would
+	 * be content with just one byte instead of 4, but that's not too much
+	 * waste.
+	 */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+} pg_atomic_flag;
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SUPPORT */
+
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+
+#define PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	/* Check pg_atomic_flag's definition above for an explanation */
+#if defined(__hppa) || defined(__hppa__)	/* HP PA-RISC, GCC and HP compilers */
+	int			sema[4];
+#else
+	int			sema;
+#endif
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* PG_HAVE_ATOMIC_U32_SUPPORT */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifdef PG_HAVE_ATOMIC_FLAG_SIMULATION
+
+#define PG_HAVE_ATOMIC_INIT_FLAG
+extern void pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAVE_ATOMIC_TEST_SET_FLAG
+extern bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAVE_ATOMIC_CLEAR_FLAG
+extern void pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr);
+
+#define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * Can't do this efficiently in the semaphore based implementation - we'd
+	 * have to try to acquire the semaphore - so always return true. That's
+	 * correct, because this is only an unlocked test anyway.Do this in the
+	 * header so compilers can optimize the test away.
+	 */
+	return true;
+}
+
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+#ifdef PG_HAVE_ATOMIC_U32_SIMULATION
+
+#define PG_HAVE_ATOMIC_INIT_U32
+extern void pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_);
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+extern bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+												uint32 *expected, uint32 newval);
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+extern uint32 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_);
+
+#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
+
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-acc.h b/src/include/port/atomics/generic-acc.h
new file mode 100644
index 0000000..627e2df
--- /dev/null
+++ b/src/include/port/atomics/generic-acc.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-acc.h
+ *	  Atomic operations support when using HPs acc on HPUX
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * inline assembly for Itanium-based HP-UX:
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/inline_assem_ERS.pdf
+ * * Implementing Spinlocks on the Intel (R) Itanium (R) Architecture and PA-RISC
+ *   http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf
+ *
+ * Itanium only supports a small set of numbers (6, -8, -4, -1, 1, 4, 8, 16)
+ * for atomic add/sub, so we just implement everything but compare_exchange
+ * via the compare_exchange fallbacks in atomics/generic.h.
+ *
+ * src/include/port/atomics/generic-acc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <machine/sys/inline.h>
+
+/* IA64 always has 32/64 bit atomics */
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#define pg_compiler_barrier_impl()	_Asm_sched_fence()
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define MINOR_FENCE (_Asm_fence) (_UP_CALL_FENCE | _UP_SYS_FENCE | \
+								 _DOWN_CALL_FENCE | _DOWN_SYS_FENCE )
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	/*
+	 * We want a barrier, not just release/acquire semantics.
+	 */
+	_Asm_mf();
+	/*
+	 * Notes:
+	 * DOWN_MEM_FENCE | _UP_MEM_FENCE prevents reordering by the compiler
+	 */
+	current =  _Asm_cmpxchg(_SZ_W, /* word */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+STATIC_IF_INLINE bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	_Asm_mov_to_ar(_AREG_CCV, *expected, MINOR_FENCE);
+	_Asm_mf();
+	current =  _Asm_cmpxchg(_SZ_D, /* doubleword */
+							_SEM_REL,
+							&ptr->value,
+							newval, _LDHINT_NONE,
+							_DOWN_MEM_FENCE | _UP_MEM_FENCE);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#undef MINOR_FENCE
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-gcc.h b/src/include/port/atomics/generic-gcc.h
new file mode 100644
index 0000000..23fa2a6
--- /dev/null
+++ b/src/include/port/atomics/generic-gcc.h
@@ -0,0 +1,236 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-gcc.h
+ *	  Atomic operations, implemented using gcc (or compatible) intrinsics.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Legacy __sync Built-in Functions for Atomic Memory Access
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fsync-Builtins.html
+ * * Built-in functions for memory model aware atomic operations
+ *   http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
+ *
+ * src/include/port/atomics/generic-gcc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/*
+ * icc provides all the same intrinsics but doesn't understand gcc's inline asm
+ */
+#if defined(__INTEL_COMPILER)
+/* NB: Yes, __memory_barrier() is actually just a compiler barrier */
+#define pg_compiler_barrier_impl()	__memory_barrier()
+#else
+#define pg_compiler_barrier_impl()	__asm__ __volatile__("" ::: "memory")
+#endif
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory barrier
+ * out of this compiler built-in.  But we prefer to rely on platform specific
+ * definitions where possible, and use this only as a fallback.
+ */
+#if !defined(pg_memory_barrier_impl)
+#	if defined(HAVE_GCC__ATOMIC_INT64_CAS)
+#		define pg_memory_barrier_impl()		__atomic_thread_fence(__ATOMIC_SEQ_CST)
+#	elif (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1))
+#		define pg_memory_barrier_impl()		__sync_synchronize()
+#	endif
+#endif /* !defined(pg_memory_barrier_impl) */
+
+#if !defined(pg_read_barrier_impl) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
+/* acquire semantics include read barrier semantics */
+#		define pg_read_barrier_impl()		__atomic_thread_fence(__ATOMIC_ACQUIRE)
+#endif
+
+#if !defined(pg_write_barrier_impl) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
+/* release semantics include write barrier semantics */
+#		define pg_write_barrier_impl()		__atomic_thread_fence(__ATOMIC_RELEASE)
+#endif
+
+#ifdef HAVE_ATOMICS
+
+/* generic gcc based atomic flag implementation */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) \
+	&& (defined(HAVE_GCC__SYNC_INT32_TAS) || defined(HAVE_GCC__SYNC_CHAR_TAS))
+
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef struct pg_atomic_flag
+{
+	/* some platforms only have a 8 bit wide TAS */
+#ifdef HAVE_GCC__SYNC_CHAR_TAS
+	volatile char value;
+#else
+	/* but an int works on more platforms */
+	volatile int value;
+#endif
+} pg_atomic_flag;
+
+#endif /* !ATOMIC_FLAG_SUPPORT && SYNC_INT32_TAS */
+
+/* generic gcc based atomic uint32 implementation */
+#if !defined(PG_HAVE_ATOMIC_U32_SUPPORT) \
+	&& (defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS))
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#endif /* defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS) */
+
+/* generic gcc based atomic uint64 implementation */
+#if !defined(PG_HAVE_ATOMIC_U64_SUPPORT) \
+	&& !defined(PG_DISABLE_64_BIT_ATOMICS) \
+	&& (defined(HAVE_GCC__ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS))
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* defined(HAVE_GCC__ATOMIC_INT64_CAS) || defined(HAVE_GCC__SYNC_INT64_CAS) */
+
+/*
+ * Implementation follows. Inlined or directly included from atomics.c
+ */
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#if !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && \
+	(defined(HAVE_GCC__SYNC_CHAR_TAS) || defined(HAVE_GCC__SYNC_INT32_TAS))
+#define PG_HAVE_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* NB: only a acquire barrier, not a full one */
+	/* some platform only support a 1 here */
+	return __sync_lock_test_and_set(&ptr->value, 1) == 0;
+}
+#endif /* !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && defined(HAVE_GCC__SYNC_*_TAS) */
+
+#ifndef PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
+#define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return ptr->value == 0;
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_CLEAR_FLAG
+#define PG_HAVE_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * XXX: It would be nicer to use __sync_lock_release here, but gcc insists
+	 * on making that an atomic op which is far to expensive and a stronger
+	 * guarantee than what we actually need.
+	 */
+	pg_write_barrier_impl();
+	ptr->value = 0;
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_INIT_FLAG
+#define PG_HAVE_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_clear_flag_impl(ptr);
+}
+#endif
+
+/* prefer __atomic, it has a better API */
+#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	/* FIXME: we can probably use a lower consistency model */
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_ADD_U32) && defined(HAVE_GCC__SYNC_INT32_CAS)
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+
+#if !defined(PG_DISABLE_64_BIT_ATOMICS)
+
+#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__ATOMIC_INT64_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	return __atomic_compare_exchange_n(&ptr->value, expected, newval, false,
+									   __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_ADD_U64) && defined(HAVE_GCC__SYNC_INT64_CAS)
+#define PG_HAVE_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __sync_fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#endif /* !defined(PG_DISABLE_64_BIT_ATOMICS) */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
+
+#endif /* defined(HAVE_ATOMICS) */
+
diff --git a/src/include/port/atomics/generic-msvc.h b/src/include/port/atomics/generic-msvc.h
new file mode 100644
index 0000000..747daa0
--- /dev/null
+++ b/src/include/port/atomics/generic-msvc.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-msvc.h
+ *	  Atomic operations support when using MSVC
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Interlocked Variable Access
+ *   http://msdn.microsoft.com/en-us/library/ms684122%28VS.85%29.aspx
+ *
+ * src/include/port/atomics/generic-msvc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <intrin.h>
+#include <windows.h>
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#error "should be included via atomics.h"
+#endif
+
+/* Should work on both MSVC and Borland. */
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier_impl()	_ReadWriteBarrier()
+
+#ifndef pg_memory_barrier_impl
+#define pg_memory_barrier_impl()	MemoryBarrier()
+#endif
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+	current = InterlockedCompareExchange(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return InterlockedExchangeAdd(&ptr->value, add_);
+}
+
+/*
+ * The non-intrinsics versions are only available in vista upwards, so use the
+ * intrinsic version. Only supported on >486, but we require XP as a minimum
+ * baseline, which doesn't support the 486, so we don't need to add checks for
+ * that case.
+ */
+#pragma intrinsic(_InterlockedCompareExchange64)
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+	current = _InterlockedCompareExchange64(&ptr->value, newval, *expected);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+/* Only implemented on itanium and 64bit builds */
+#ifdef _WIN64
+#pragma intrinsic(_InterlockedExchangeAdd64)
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return _InterlockedExchangeAdd64(&ptr->value, add_);
+}
+#endif /* _WIN64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-sunpro.h b/src/include/port/atomics/generic-sunpro.h
new file mode 100644
index 0000000..b71b523
--- /dev/null
+++ b/src/include/port/atomics/generic-sunpro.h
@@ -0,0 +1,74 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-sunpro.h
+ *	  Atomic operations for solaris' CC
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * manpage for atomic_cas(3C)
+ *   http://www.unix.com/man-page/opensolaris/3c/atomic_cas/
+ *   http://docs.oracle.com/cd/E23824_01/html/821-1465/atomic-cas-3c.html
+ *
+ * src/include/port/atomics/generic-sunpro.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+/* Older versions of the compiler don't have atomic.h... */
+#ifdef HAVE_ATOMIC_H
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* HAVE_ATOMIC_H */
+
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifdef HAVE_ATOMIC_H
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint32	current;
+
+	current = atomic_cas_32(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	current = atomic_cas_64(&ptr->value, *expected, newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#endif /* HAVE_ATOMIC_H */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
new file mode 100644
index 0000000..579554d
--- /dev/null
+++ b/src/include/port/atomics/generic-xlc.h
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic-xlc.h
+ *	  Atomic operations for IBM's CC
+ *
+ * Portions Copyright (c) 2013-2014, PostgreSQL Global Development Group
+ *
+ * NOTES:
+ *
+ * Documentation:
+ * * Synchronization and atomic built-in functions
+ *   http://publib.boulder.ibm.com/infocenter/lnxpcomp/v8v101/topic/com.ibm.xlcpp8l.doc/compiler/ref/bif_sync.htm
+ *
+ * src/include/port/atomics/generic-xlc.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include <atomic.h>
+
+#define PG_HAVE_ATOMIC_U32_SUPPORT
+typedef struct pg_atomic_uint32
+{
+	volatile uint32 value;
+} pg_atomic_uint32;
+
+
+/* 64bit atomics are only supported in 64bit mode */
+#ifdef __64BIT__
+#define PG_HAVE_ATOMIC_U64_SUPPORT
+typedef struct pg_atomic_uint64
+{
+	volatile uint64 value;
+} pg_atomic_uint64;
+
+#endif /* __64BIT__ */
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
+static inline bool
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+									uint32 *expected, uint32 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	/*
+	 * xlc's documentation tells us:
+	 * "If __compare_and_swap is used as a locking primitive, insert a call to
+	 * the __isync built-in function at the start of any critical sections."
+	 */
+	__isync();
+
+	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
+										 (int)*expected, (int)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return __fetch_and_add(&ptr->value, add_);
+}
+#endif
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+#define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64
+static inline bool
+pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
+									uint64 *expected, uint64 newval)
+{
+	bool	ret;
+	uint64	current;
+
+	__isync();
+
+	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
+										   (long)*expected, (long)newval);
+	ret = current == *expected;
+	*expected = current;
+	return ret;
+}
+
+#define PG_HAVE_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return __fetch_and_addlp(&ptr->value, add_);
+}
+#endif
+
+#endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/port/atomics/generic.h b/src/include/port/atomics/generic.h
new file mode 100644
index 0000000..fa75123
--- /dev/null
+++ b/src/include/port/atomics/generic.h
@@ -0,0 +1,387 @@
+/*-------------------------------------------------------------------------
+ *
+ * generic.h
+ *	  Implement higher level operations based on some lower level tomic
+ *	  operations.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/atomics/generic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* intentionally no include guards, should only be included by atomics.h */
+#ifndef INSIDE_ATOMICS_H
+#	error "should be included via atomics.h"
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ */
+#if !defined(pg_read_barrier_impl)
+#	define pg_read_barrier_impl pg_memory_barrier_impl
+#endif
+#if !defined(pg_write_barrier_impl)
+#	define pg_write_barrier_impl pg_memory_barrier_impl
+#endif
+
+#ifndef PG_HAVE_SPIN_DELAY
+#define PG_HAVE_SPIN_DELAY
+#define pg_spin_delay_impl()	((void)0)
+#endif
+
+
+/* provide fallback */
+#if !defined(PG_HAVE_ATOMIC_FLAG_SUPPORT) && defined(PG_HAVE_ATOMIC_U32_SUPPORT)
+#define PG_HAVE_ATOMIC_FLAG_SUPPORT
+typedef pg_atomic_uint32 pg_atomic_flag;
+#endif
+
+#if defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS)
+
+#ifndef PG_HAVE_ATOMIC_READ_U32
+#define PG_HAVE_ATOMIC_READ_U32
+static inline uint32
+pg_atomic_read_u32_impl(volatile pg_atomic_uint32 *ptr)
+{
+	return *(&ptr->value);
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_WRITE_U32
+#define PG_HAVE_ATOMIC_WRITE_U32
+static inline void
+pg_atomic_write_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val)
+{
+	ptr->value = val;
+}
+#endif
+
+/*
+ * provide fallback for test_and_set using atomic_exchange if available
+ */
+#if !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
+
+#define PG_HAVE_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAVE_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_exchange_u32_impl(ptr, &value, 1) == 0;
+}
+
+#define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_read_u32_impl(ptr) == 0;
+}
+
+
+#define PG_HAVE_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+/*
+ * provide fallback for test_and_set using atomic_compare_exchange if
+ * available.
+ */
+#elif !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+
+#define PG_HAVE_ATOMIC_INIT_FLAG
+static inline void
+pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	pg_atomic_write_u32_impl(ptr, 0);
+}
+
+#define PG_HAVE_ATOMIC_TEST_SET_FLAG
+static inline bool
+pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	uint32 value = 0;
+	return pg_atomic_compare_exchange_u32_impl(ptr, &value, 1);
+}
+
+#define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG
+static inline bool
+pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	return pg_atomic_read_u32_impl(ptr) == 0;
+}
+
+#define PG_HAVE_ATOMIC_CLEAR_FLAG
+static inline void
+pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
+{
+	/*
+	 * Use a memory barrier + plain write if we have a native memory
+	 * barrier. But don't do so if memory barriers use spinlocks - that'd lead
+	 * to circularity if flags are used to implement spinlocks.
+	 */
+#ifndef PG_HAVE_MEMORY_BARRIER_EMULATION
+	/* XXX: release semantics suffice? */
+	pg_memory_barrier_impl();
+	pg_atomic_write_u32_impl(ptr, 0);
+#else
+	uint32 value = 1;
+	pg_atomic_compare_exchange_u32_impl(ptr, &value, 0);
+#endif
+}
+
+#elif !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG)
+#	error "No pg_atomic_test_and_set provided"
+#endif /* !defined(PG_HAVE_ATOMIC_TEST_SET_FLAG) */
+
+
+#ifndef PG_HAVE_ATOMIC_INIT_U32
+#define PG_HAVE_ATOMIC_INIT_U32
+static inline void
+pg_atomic_init_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 val_)
+{
+	pg_atomic_write_u32_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_EXCHANGE_U32) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAVE_ATOMIC_EXCHANGE_U32
+static inline uint32
+pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_ADD_U32) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAVE_ATOMIC_FETCH_ADD_U32
+static inline uint32
+pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_SUB_U32) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAVE_ATOMIC_FETCH_SUB_U32
+static inline uint32
+pg_atomic_fetch_sub_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U32) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAVE_ATOMIC_FETCH_AND_U32
+static inline uint32
+pg_atomic_fetch_and_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 and_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U32) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
+#define PG_HAVE_ATOMIC_FETCH_OR_U32
+static inline uint32
+pg_atomic_fetch_or_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 or_)
+{
+	uint32 old;
+	while (true)
+	{
+		old = pg_atomic_read_u32_impl(ptr);
+		if (pg_atomic_compare_exchange_u32_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_ADD_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_ADD_U32)
+#define PG_HAVE_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+	return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U32) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U32)
+#define PG_HAVE_ATOMIC_SUB_FETCH_U32
+static inline uint32
+pg_atomic_sub_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 sub_)
+{
+	return pg_atomic_fetch_sub_u32_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+
+#if !defined(PG_HAVE_ATOMIC_EXCHANGE_U64) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAVE_ATOMIC_EXCHANGE_U64
+static inline uint64
+pg_atomic_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 xchg_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = ptr->value;
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, xchg_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_WRITE_U64
+#define PG_HAVE_ATOMIC_WRITE_U64
+static inline void
+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
+{
+	/*
+	 * 64 bit writes aren't safe on all platforms. In the generic
+	 * implementation implement them as an atomic exchange.
+	 */
+	pg_atomic_exchange_u64_impl(ptr, val);
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_READ_U64
+#define PG_HAVE_ATOMIC_READ_U64
+static inline uint64
+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
+	uint64 old = 0;
+
+	/*
+	 * 64 bit reads aren't safe on all platforms. In the generic
+	 * implementation implement them as a compare/exchange with 0. That'll
+	 * fail or succeed, but always return the old value. Possible might store
+	 * a 0, but only if the prev. value also was a 0 - i.e. harmless.
+	 */
+	pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
+
+	return old;
+}
+#endif
+
+#ifndef PG_HAVE_ATOMIC_INIT_U64
+#define PG_HAVE_ATOMIC_INIT_U64
+static inline void
+pg_atomic_init_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val_)
+{
+	pg_atomic_write_u64_impl(ptr, val_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_ADD_U64) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAVE_ATOMIC_FETCH_ADD_U64
+static inline uint64
+pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old + add_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_SUB_U64) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAVE_ATOMIC_FETCH_SUB_U64
+static inline uint64
+pg_atomic_fetch_sub_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, -sub_);
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_AND_U64) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAVE_ATOMIC_FETCH_AND_U64
+static inline uint64
+pg_atomic_fetch_and_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 and_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old & and_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_FETCH_OR_U64) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64)
+#define PG_HAVE_ATOMIC_FETCH_OR_U64
+static inline uint64
+pg_atomic_fetch_or_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 or_)
+{
+	uint64 old;
+	while (true)
+	{
+		old = pg_atomic_read_u64_impl(ptr);
+		if (pg_atomic_compare_exchange_u64_impl(ptr, &old, old | or_))
+			break;
+	}
+	return old;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_ADD_FETCH_U64) && defined(PG_HAVE_ATOMIC_FETCH_ADD_U64)
+#define PG_HAVE_ATOMIC_ADD_FETCH_U64
+static inline uint64
+pg_atomic_add_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
+{
+	return pg_atomic_fetch_add_u64_impl(ptr, add_) + add_;
+}
+#endif
+
+#if !defined(PG_HAVE_ATOMIC_SUB_FETCH_U64) && defined(PG_HAVE_ATOMIC_FETCH_SUB_U64)
+#define PG_HAVE_ATOMIC_SUB_FETCH_U64
+static inline uint64
+pg_atomic_sub_fetch_u64_impl(volatile pg_atomic_uint64 *ptr, int64 sub_)
+{
+	return pg_atomic_fetch_sub_u64_impl(ptr, sub_) - sub_;
+}
+#endif
+
+#endif /* PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U64 */
+
+#endif /* defined(PG_USE_INLINE) || defined(ATOMICS_INCLUDE_DEFINITIONS) */
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index abb82c2..b36705b 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -13,158 +13,11 @@
 #ifndef BARRIER_H
 #define BARRIER_H
 
-#include "storage/s_lock.h"
-
-extern slock_t dummy_spinlock;
-
-/*
- * A compiler barrier need not (and preferably should not) emit any actual
- * machine code, but must act as an optimization fence: the compiler must not
- * reorder loads or stores to main memory around the barrier.  However, the
- * CPU may still reorder loads or stores at runtime, if the architecture's
- * memory model permits this.
- *
- * A memory barrier must act as a compiler barrier, and in addition must
- * guarantee that all loads and stores issued prior to the barrier are
- * completed before any loads or stores issued after the barrier.  Unless
- * loads and stores are totally ordered (which is not the case on most
- * architectures) this requires issuing some sort of memory fencing
- * instruction.
- *
- * A read barrier must act as a compiler barrier, and in addition must
- * guarantee that any loads issued prior to the barrier are completed before
- * any loads issued after the barrier.  Similarly, a write barrier acts
- * as a compiler barrier, and also orders stores.  Read and write barriers
- * are thus weaker than a full memory barrier, but stronger than a compiler
- * barrier.  In practice, on machines with strong memory ordering, read and
- * write barriers may require nothing more than a compiler barrier.
- *
- * For an introduction to using memory barriers within the PostgreSQL backend,
- * see src/backend/storage/lmgr/README.barrier
- */
-
-#if defined(DISABLE_BARRIERS)
-
-/*
- * Fall through to the spinlock-based implementation.
- */
-#elif defined(__INTEL_COMPILER)
-
-/*
- * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
- */
-#if defined(__ia64__) || defined(__ia64)
-#define pg_memory_barrier()		__mf()
-#elif defined(__i386__) || defined(__x86_64__)
-#define pg_memory_barrier()		_mm_mfence()
-#endif
-
-#define pg_compiler_barrier()	__memory_barrier()
-#elif defined(__GNUC__)
-
-/* This works on any architecture, since it's only talking to GCC itself. */
-#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
-
-#if defined(__i386__)
-
-/*
- * i386 does not allow loads to be reordered with other loads, or stores to be
- * reordered with other stores, but a load can be performed before a subsequent
- * store.
- *
- * "lock; addl" has worked for longer than "mfence".
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory", "cc")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__x86_64__)		/* 64 bit x86 */
-
 /*
- * x86_64 has similar ordering characteristics to i386.
- *
- * Technically, some x86-ish chips support uncached memory access and/or
- * special instructions that are weakly ordered.  In those cases we'd need
- * the read and write barriers to be lfence and sfence.  But since we don't
- * do those things, a compiler barrier should be enough.
- */
-#define pg_memory_barrier()		\
-	__asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc")
-#define pg_read_barrier()		pg_compiler_barrier()
-#define pg_write_barrier()		pg_compiler_barrier()
-#elif defined(__ia64__) || defined(__ia64)
-
-/*
- * Itanium is weakly ordered, so read and write barriers require a full
- * fence.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
-#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
-
-/*
- * lwsync orders loads with respect to each other, and similarly with stores.
- * But a load can be performed before a subsequent store, so sync must be used
- * for a full memory barrier.
- */
-#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
-#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
-
-#elif defined(__hppa) || defined(__hppa__)		/* HP PA-RISC */
-
-/* HPPA doesn't do either read or write reordering */
-#define pg_memory_barrier()		pg_compiler_barrier()
-#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
-
-/*
- * If we're on GCC 4.1.0 or higher, we should be able to get a memory
- * barrier out of this compiler built-in.  But we prefer to rely on our
- * own definitions where possible, and use this only as a fallback.
- */
-#define pg_memory_barrier()		__sync_synchronize()
-#endif
-#elif defined(__ia64__) || defined(__ia64)
-
-#define pg_compiler_barrier()	_Asm_sched_fence()
-#define pg_memory_barrier()		_Asm_mf()
-#elif defined(WIN32_ONLY_COMPILER)
-
-/* Should work on both MSVC and Borland. */
-#include <intrin.h>
-#pragma intrinsic(_ReadWriteBarrier)
-#define pg_compiler_barrier()	_ReadWriteBarrier()
-#define pg_memory_barrier()		MemoryBarrier()
-#endif
-
-/*
- * If we have no memory barrier implementation for this architecture, we
- * fall back to acquiring and releasing a spinlock.  This might, in turn,
- * fall back to the semaphore-based spinlock implementation, which will be
- * amazingly slow.
- *
- * It's not self-evident that every possible legal implementation of a
- * spinlock acquire-and-release would be equivalent to a full memory barrier.
- * For example, I'm not sure that Itanium's acq and rel add up to a full
- * fence.  But all of our actual implementations seem OK in this regard.
- */
-#if !defined(pg_memory_barrier)
-#define pg_memory_barrier() \
-	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
-#endif
-
-/*
- * If read or write barriers are undefined, we upgrade them to full memory
- * barriers.
- *
- * If a compiler barrier is unavailable, you probably don't want a full
- * memory barrier instead, so if you have a use case for a compiler barrier,
- * you'd better use #ifdef.
+ * This used to be a separate file, full of compiler/architecture
+ * dependent defines, but it's not included in the atomics.h
+ * infrastructure and just kept for backward compatibility.
  */
-#if !defined(pg_read_barrier)
-#define pg_read_barrier()			pg_memory_barrier()
-#endif
-#if !defined(pg_write_barrier)
-#define pg_write_barrier()			pg_memory_barrier()
-#endif
+#include "port/atomics.h"
 
 #endif   /* BARRIER_H */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 5f0ad7a..79bc125 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -309,7 +309,7 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC_INT_ATOMICS
+#ifdef HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
 #define TAS(lock) tas(lock)
@@ -324,7 +324,7 @@ tas(volatile slock_t *lock)
 
 #define S_UNLOCK(lock) __sync_lock_release(lock)
 
-#endif	 /* HAVE_GCC_INT_ATOMICS */
+#endif	 /* HAVE_GCC__SYNC_INT32_TAS */
 #endif	 /* __arm__ || __arm || __aarch64__ || __aarch64 */
 
 
@@ -889,12 +889,12 @@ typedef int slock_t;
 
 extern bool s_lock_free_sema(volatile slock_t *lock);
 extern void s_unlock_sema(volatile slock_t *lock);
-extern void s_init_lock_sema(volatile slock_t *lock);
+extern void s_init_lock_sema(volatile slock_t *lock, bool nested);
 extern int	tas_sema(volatile slock_t *lock);
 
 #define S_LOCK_FREE(lock)	s_lock_free_sema(lock)
 #define S_UNLOCK(lock)	 s_unlock_sema(lock)
-#define S_INIT_LOCK(lock)	s_init_lock_sema(lock)
+#define S_INIT_LOCK(lock)	s_init_lock_sema(lock, false)
 #define TAS(lock)	tas_sema(lock)
 
 
@@ -955,6 +955,7 @@ extern int	tas(volatile slock_t *lock);		/* in port/.../tas.s, or
 #define TAS_SPIN(lock)	TAS(lock)
 #endif	 /* TAS_SPIN */
 
+extern slock_t dummy_spinlock;
 
 /*
  * Platform-independent out-of-line support routines
diff --git a/src/test/regress/expected/lock.out b/src/test/regress/expected/lock.out
index 0d7c3ba..fd27344 100644
--- a/src/test/regress/expected/lock.out
+++ b/src/test/regress/expected/lock.out
@@ -60,3 +60,11 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
+ test_atomic_ops 
+-----------------
+ t
+(1 row)
+
diff --git a/src/test/regress/input/create_function_1.source b/src/test/regress/input/create_function_1.source
index aef1518..1fded84 100644
--- a/src/test/regress/input/create_function_1.source
+++ b/src/test/regress/input/create_function_1.source
@@ -57,6 +57,11 @@ CREATE FUNCTION make_tuple_indirect (record)
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
 
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
+
 -- Things that shouldn't work:
 
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
diff --git a/src/test/regress/output/create_function_1.source b/src/test/regress/output/create_function_1.source
index 9761d12..9926c90 100644
--- a/src/test/regress/output/create_function_1.source
+++ b/src/test/regress/output/create_function_1.source
@@ -51,6 +51,10 @@ CREATE FUNCTION make_tuple_indirect (record)
         RETURNS record
         AS '@libdir@/regress@DLSUFFIX@'
         LANGUAGE C STRICT;
+CREATE FUNCTION test_atomic_ops()
+    RETURNS bool
+    AS '@libdir@/regress@DLSUFFIX@'
+    LANGUAGE C;
 -- Things that shouldn't work:
 CREATE FUNCTION test1 (int) RETURNS int LANGUAGE SQL
     AS 'SELECT ''not an integer'';';
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index 09e027c..1487171 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -18,6 +18,7 @@
 #include "executor/executor.h"
 #include "executor/spi.h"
 #include "miscadmin.h"
+#include "port/atomics.h"
 #include "utils/builtins.h"
 #include "utils/geo_decls.h"
 #include "utils/rel.h"
@@ -865,3 +866,241 @@ wait_pid(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+static void
+test_atomic_flag(void)
+{
+	pg_atomic_flag flag;
+
+	pg_atomic_init_flag(&flag);
+
+	if (!pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly set");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	if (pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly unset");
+
+	if (pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: set spuriously #2");
+
+	pg_atomic_clear_flag(&flag);
+
+	if (!pg_atomic_unlocked_test_flag(&flag))
+		elog(ERROR, "flag: unexpectedly set #2");
+
+	if (!pg_atomic_test_set_flag(&flag))
+		elog(ERROR, "flag: couldn't set");
+
+	pg_atomic_clear_flag(&flag);
+}
+#endif /* PG_HAVE_ATOMIC_FLAG_SIMULATION */
+
+static void
+test_atomic_uint32(void)
+{
+	pg_atomic_uint32 var;
+	uint32 expected;
+	int i;
+
+	pg_atomic_init_u32(&var, 0);
+
+	if (pg_atomic_read_u32(&var) != 0)
+		elog(ERROR, "atomic_read_u32() #1 wrong");
+
+	pg_atomic_write_u32(&var, 3);
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u32() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u32() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u32(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u32() #1 wrong");
+
+	if (pg_atomic_add_fetch_u32(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u32() #1 wrong");
+
+	if (pg_atomic_exchange_u32(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u32() #0 wrong");
+
+	/* test around numerical limits */
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != 0)
+		elog(ERROR, "pg_atomic_fetch_add_u32() #2 wrong");
+
+	if (pg_atomic_fetch_add_u32(&var, INT_MAX) != INT_MAX)
+		elog(ERROR, "pg_atomic_add_fetch_u32() #3 wrong");
+
+	pg_atomic_fetch_add_u32(&var, 1); /* top up to UINT_MAX */
+
+	if (pg_atomic_read_u32(&var) != UINT_MAX)
+		elog(ERROR, "atomic_read_u32() #2 wrong");
+
+	if (pg_atomic_fetch_sub_u32(&var, INT_MAX) != UINT_MAX)
+		elog(ERROR, "pg_atomic_fetch_sub_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != (uint32)INT_MAX + 1)
+		elog(ERROR, "atomic_read_u32() #3 wrong: %u", pg_atomic_read_u32(&var));
+
+	expected = pg_atomic_sub_fetch_u32(&var, INT_MAX);
+	if (expected != 1)
+		elog(ERROR, "pg_atomic_sub_fetch_u32() #3 wrong: %u", expected);
+
+	pg_atomic_sub_fetch_u32(&var, 1);
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u32(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u32() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 1000; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u32(&var, &expected, 1))
+			break;
+	}
+	if (i == 1000)
+		elog(ERROR, "atomic_compare_exchange_u32() never succeeded");
+	if (pg_atomic_read_u32(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u32() didn't set value properly");
+
+	pg_atomic_write_u32(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u32(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u32() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u32(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u32() #2 wrong");
+
+	if (pg_atomic_read_u32(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u32()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u32(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #1 wrong");
+
+	if (pg_atomic_fetch_and_u32(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #2 wrong: is %u",
+			 pg_atomic_read_u32(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u32(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u32() #3 wrong");
+}
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+static void
+test_atomic_uint64(void)
+{
+	pg_atomic_uint64 var;
+	uint64 expected;
+	int i;
+
+	pg_atomic_init_u64(&var, 0);
+
+	if (pg_atomic_read_u64(&var) != 0)
+		elog(ERROR, "atomic_read_u64() #1 wrong");
+
+	pg_atomic_write_u64(&var, 3);
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "atomic_read_u64() #2 wrong");
+
+	if (pg_atomic_fetch_add_u64(&var, 1) != 3)
+		elog(ERROR, "atomic_fetch_add_u64() #1 wrong");
+
+	if (pg_atomic_fetch_sub_u64(&var, 1) != 4)
+		elog(ERROR, "atomic_fetch_sub_u64() #1 wrong");
+
+	if (pg_atomic_sub_fetch_u64(&var, 3) != 0)
+		elog(ERROR, "atomic_sub_fetch_u64() #1 wrong");
+
+	if (pg_atomic_add_fetch_u64(&var, 10) != 10)
+		elog(ERROR, "atomic_add_fetch_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 5) != 10)
+		elog(ERROR, "pg_atomic_exchange_u64() #1 wrong");
+
+	if (pg_atomic_exchange_u64(&var, 0) != 5)
+		elog(ERROR, "pg_atomic_exchange_u64() #0 wrong");
+
+	/* fail exchange because of old expected */
+	expected = 10;
+	if (pg_atomic_compare_exchange_u64(&var, &expected, 1))
+		elog(ERROR, "atomic_compare_exchange_u64() changed value spuriously");
+
+	/* CAS is allowed to fail due to interrupts, try a couple of times */
+	for (i = 0; i < 100; i++)
+	{
+		expected = 0;
+		if (!pg_atomic_compare_exchange_u64(&var, &expected, 1))
+			break;
+	}
+	if (i == 100)
+		elog(ERROR, "atomic_compare_exchange_u64() never succeeded");
+	if (pg_atomic_read_u64(&var) != 1)
+		elog(ERROR, "atomic_compare_exchange_u64() didn't set value properly");
+
+	pg_atomic_write_u64(&var, 0);
+
+	/* try setting flagbits */
+	if (pg_atomic_fetch_or_u64(&var, 1) & 1)
+		elog(ERROR, "pg_atomic_fetch_or_u64() #1 wrong");
+
+	if (!(pg_atomic_fetch_or_u64(&var, 2) & 1))
+		elog(ERROR, "pg_atomic_fetch_or_u64() #2 wrong");
+
+	if (pg_atomic_read_u64(&var) != 3)
+		elog(ERROR, "invalid result after pg_atomic_fetch_or_u64()");
+
+	/* try clearing flagbits */
+	if ((pg_atomic_fetch_and_u64(&var, ~2) & 3) != 3)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #1 wrong");
+
+	if (pg_atomic_fetch_and_u64(&var, ~1) != 1)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #2 wrong: is "UINT64_FORMAT,
+			 pg_atomic_read_u64(&var));
+	/* no bits set anymore */
+	if (pg_atomic_fetch_and_u64(&var, ~0) != 0)
+		elog(ERROR, "pg_atomic_fetch_and_u64() #3 wrong");
+}
+#endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
+
+
+PG_FUNCTION_INFO_V1(test_atomic_ops);
+Datum
+test_atomic_ops(PG_FUNCTION_ARGS)
+{
+	/* ---
+	 * Can't run the test under the semaphore emulation, it doesn't handle
+	 * checking two edge cases well:
+	 * - pg_atomic_unlocked_test_flag() always returns true
+	 * - locking a already locked flag blocks
+	 * it seems better to not test the semaphore fallback here, than weaken
+	 * the checks for the other cases. The semaphore code will be the same
+	 * everywhere, whereas the efficient implementations wont.
+	 * ---
+	 */
+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
+	test_atomic_flag();
+#endif
+
+	test_atomic_uint32();
+
+#ifdef PG_HAVE_ATOMIC_U64_SUPPORT
+	test_atomic_uint64();
+#endif
+
+	PG_RETURN_BOOL(true);
+}
diff --git a/src/test/regress/sql/lock.sql b/src/test/regress/sql/lock.sql
index dda212f..567e8bc 100644
--- a/src/test/regress/sql/lock.sql
+++ b/src/test/regress/sql/lock.sql
@@ -64,3 +64,8 @@ DROP TABLE lock_tbl2;
 DROP TABLE lock_tbl1;
 DROP SCHEMA lock_schema1 CASCADE;
 DROP ROLE regress_rol_lock1;
+
+
+-- atomic ops tests
+RESET search_path;
+SELECT test_atomic_ops();
-- 
2.0.0.rc2.4.g1dc51c6.dirty

#111Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#110)
Re: better atomics - v0.6

Andres,

* Andres Freund (andres@anarazel.de) wrote:

Heikki has marked the patch as 'ready for commiter' in the commitfest
(conditional on his remarks being addressed) and I agree. There seems to
be little benefit in waiting further. There *definitely* will be some
platform dependant issues, but that won't change by waiting longer.

Agreed. I won't claim to have done as good a review as others, but I've
been following along as best I've been able to.

I plan to commit this quite soon unless somebody protests really
quickly.

+1 from me for at least moving forward to see what happens.

Sorting out the issues on platforms I don't have access to based on
buildfarm feedback will take a while given how infrequent some of the
respective animals run... But that's just life.

The buildfarm is another tool (as well as Coverity) that provides a lot
of insight into issues we may not realize ahead of time, but using them
(currently) means we have to eventually go beyond looking at the code
individually and actually commit it.

I'm very excited to see this progress and trust you, Heikki, Robert, and
the others involved in this work to have done an excellent job and to be
ready and prepared to address any issues which come from it, or to pull
it out if it turns out to be necessary (which I seriously doubt, but
still).

Thanks!

Stephen

#112Peter Geoghegan
pg@heroku.com
In reply to: Stephen Frost (#111)
Re: better atomics - v0.6

On Thu, Sep 25, 2014 at 3:23 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

Heikki has marked the patch as 'ready for commiter' in the commitfest
(conditional on his remarks being addressed) and I agree. There seems to
be little benefit in waiting further. There *definitely* will be some
platform dependant issues, but that won't change by waiting longer.

Agreed. I won't claim to have done as good a review as others, but I've
been following along as best I've been able to.

FWIW, I agree that you've already done plenty to ensure that this
works well on common platforms. We cannot reasonably expect you to be
just as sure that there are not problems on obscure platforms ahead of
commit. It isn't unheard of for certain committers (that don't like
Windows, say), to commit things despite having a suspicion that their
patch will turn some fraction of the buildfarm red. To a certain
extent, that's what we have it for.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#110)
Re: better atomics - v0.6

On Thu, Sep 25, 2014 at 6:03 PM, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-24 20:27:39 +0200, Andres Freund wrote:

On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:
I won't repost a version with it removed, as removing a function as the
only doesn't seem to warrant reposting it.

I've fixed (thanks Alvaro!) some minor additional issues besides the
removal and addressing earlier comments from Heikki:
* removal of two remaining non-ascii copyright signs. The only non ascii
name character that remains is Alvaro's name in the commit message.
* additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE
* there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split
into different configure tests and thus has a different name. A later
patch entirely removes that reference, which is why I'd missed that...

Heikki has marked the patch as 'ready for commiter' in the commitfest
(conditional on his remarks being addressed) and I agree. There seems to
be little benefit in waiting further. There *definitely* will be some
platform dependant issues, but that won't change by waiting longer.

I plan to commit this quite soon unless somebody protests really
quickly.

Sorting out the issues on platforms I don't have access to based on
buildfarm feedback will take a while given how infrequent some of the
respective animals run... But that's just life.

I feel like this could really use a developer README: what primitives
do we have, which ones are likely to be efficient or inefficient on
which platforms, how do atomics interact with barriers, etc.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#113)
Re: better atomics - v0.6

On 2014-09-25 21:02:28 -0400, Robert Haas wrote:

I feel like this could really use a developer README: what primitives
do we have, which ones are likely to be efficient or inefficient on
which platforms, how do atomics interact with barriers, etc.

atomics.h has most of that. Including documenting which barrier
semantics the individual operations have. One change I can see as being
a good idea is to move/transform the definition of acquire/release
barriers from s_lock.h to README.barrier.

It doesn't document which operations are efficient on which platform,
but I wouldn't know what to say about that topic? The current set of
operations should be fast on pretty much all platforms.

What information would you like to see? An example of how to use
atomics?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#36)
1 attachment(s)
xlc atomics

On Wed, Jun 25, 2014 at 07:14:34PM +0200, Andres Freund wrote:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
should be relatively easy to fix up.

I tried this on three xlc configurations.

(1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working
required the attached patch. None of my xlc configurations have an <atomic.h>
header, and a web search turned up no evidence of one in connection with xlc
platforms. Did you learn of a configuration needing atomic.h under xlc? The
rest of the changes are hopefully self-explanatory in light of the
documentation cited in generic-xlc.h. (Building on AIX has regressed in other
ways unrelated to atomics; I will write more about that in due course.)

(2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
http://www-01.ibm.com/support/docview.wss?uid=swg27044056&amp;aid=1. This
compiler has a Clang-derived C frontend. It defines __GNUC__ and offers
GCC-style __sync_* atomics. Therefore, PostgreSQL selects generic-gcc.h.
test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
segfaults at runtime. I have reported this to the vendor. Adding
"pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
workaround. I could add a comment about that to src/test/regress/sql/lock.sql
for affected folks to see in regression.diffs. To do better, we could make
PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet
another option is to force use of generic-xlc.h on this compiler.

(3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
modifying atomics.h to force use of generic-xlc.h. While not a supported
PostgreSQL configuration, I felt this would make an interesting data point.
It worked fine after applying the patch developed for the AIX configuration.

Thanks,
nm

Attachments:

generic-xlc-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
index 1c743f2..0ad9168 100644
--- a/src/include/port/atomics/generic-xlc.h
+++ b/src/include/port/atomics/generic-xlc.h
@@ -18,8 +18,6 @@
 
 #if defined(HAVE_ATOMICS)
 
-#include <atomic.h>
-
 #define PG_HAVE_ATOMIC_U32_SUPPORT
 typedef struct pg_atomic_uint32
 {
@@ -48,9 +46,6 @@ static inline bool
 pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 									uint32 *expected, uint32 newval)
 {
-	bool	ret;
-	uint64	current;
-
 	/*
 	 * xlc's documentation tells us:
 	 * "If __compare_and_swap is used as a locking primitive, insert a call to
@@ -62,18 +57,15 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	 * XXX: __compare_and_swap is defined to take signed parameters, but that
 	 * shouldn't matter since we don't perform any arithmetic operations.
 	 */
-	current = (uint32)__compare_and_swap((volatile int*)ptr->value,
-										 (int)*expected, (int)newval);
-	ret = current == *expected;
-	*expected = current;
-	return ret;
+	return __compare_and_swap((volatile int*)&ptr->value,
+							  (int *)expected, (int)newval);
 }
 
 #define PG_HAVE_ATOMIC_FETCH_ADD_U32
 static inline uint32
 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 {
-	return __fetch_and_add(&ptr->value, add_);
+	return __fetch_and_add((volatile int *)&ptr->value, add_);
 }
 
 #ifdef PG_HAVE_ATOMIC_U64_SUPPORT
@@ -83,23 +75,17 @@ static inline bool
 pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 									uint64 *expected, uint64 newval)
 {
-	bool	ret;
-	uint64	current;
-
 	__isync();
 
-	current = (uint64)__compare_and_swaplp((volatile long*)ptr->value,
-										   (long)*expected, (long)newval);
-	ret = current == *expected;
-	*expected = current;
-	return ret;
+	return __compare_and_swaplp((volatile long*)&ptr->value,
+								(long *)expected, (long)newval);;
 }
 
 #define PG_HAVE_ATOMIC_FETCH_ADD_U64
 static inline uint64
 pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 {
-	return __fetch_and_addlp(&ptr->value, add_);
+	return __fetch_and_addlp((volatile long *)&ptr->value, add_);
 }
 
 #endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
#116Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#115)
Re: xlc atomics

On 2015-07-04 18:40:41 -0400, Noah Misch wrote:

(1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working
required the attached patch. None of my xlc configurations have an <atomic.h>
header, and a web search turned up no evidence of one in connection with xlc
platforms. Did you learn of a configuration needing atomic.h under xlc? The
rest of the changes are hopefully self-explanatory in light of the
documentation cited in generic-xlc.h. (Building on AIX has regressed in other
ways unrelated to atomics; I will write more about that in due course.)

I wrote this entirely blindly, as evidenced here by the changes you
needed. At the time somebody had promised to soon put up an aix animal,
but that apparently didn't work out.

Will you apply? Having the ability to test change seems to put you in a
much better spot then me.

(2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
http://www-01.ibm.com/support/docview.wss?uid=swg27044056&amp;aid=1. This
compiler has a Clang-derived C frontend. It defines __GNUC__ and offers
GCC-style __sync_* atomics.

Phew. I don't see much reason to try to support this. Why would that be
interesting?

Therefore, PostgreSQL selects generic-gcc.h.
test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
segfaults at runtime. I have reported this to the vendor. Adding
"pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
workaround. I could add a comment about that to src/test/regress/sql/lock.sql
for affected folks to see in regression.diffs. To do better, we could make
PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet
another option is to force use of generic-xlc.h on this compiler.

It seems fair enough to simply add another test and include
generic-xlc.h in that case. If it's indeed xlc, why not?

(3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
modifying atomics.h to force use of generic-xlc.h. While not a supported
PostgreSQL configuration, I felt this would make an interesting data point.
It worked fine after applying the patch developed for the AIX configuration.

Cool.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#116)
1 attachment(s)
Re: xlc atomics

On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:

On 2015-07-04 18:40:41 -0400, Noah Misch wrote:

(1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working
required the attached patch.

Will you apply? Having the ability to test change seems to put you in a
much better spot then me.

I will.

(2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
http://www-01.ibm.com/support/docview.wss?uid=swg27044056&amp;aid=1. This
compiler has a Clang-derived C frontend. It defines __GNUC__ and offers
GCC-style __sync_* atomics.

Phew. I don't see much reason to try to support this. Why would that be
interesting?

Therefore, PostgreSQL selects generic-gcc.h.
test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
segfaults at runtime. I have reported this to the vendor. Adding
"pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
workaround. I could add a comment about that to src/test/regress/sql/lock.sql
for affected folks to see in regression.diffs. To do better, we could make
PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible. Yet
another option is to force use of generic-xlc.h on this compiler.

It seems fair enough to simply add another test and include
generic-xlc.h in that case. If it's indeed xlc, why not?

Works for me. I'll do that as attached.

Attachments:

prefer-xlc-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
index 1a4c748..97a0064 100644
--- a/src/include/port/atomics.h
+++ b/src/include/port/atomics.h
@@ -81,8 +81,15 @@
  * * pg_atomic_compare_exchange_u32(), pg_atomic_fetch_add_u32()
  * using compiler intrinsics are a good idea.
  */
+/*
+ * Given a gcc-compatible xlc compiler, prefer the xlc implementation.  The
+ * ppc64le "IBM XL C/C++ for Linux, V13.1.2" implements both interfaces, but
+ * __sync_lock_test_and_set() of one-byte types elicits SIGSEGV.
+ */
+#if defined(__IBMC__) || defined(__IBMCPP__)
+#include "port/atomics/generic-xlc.h"
 /* gcc or compatible, including clang and icc */
-#if defined(__GNUC__) || defined(__INTEL_COMPILER)
+#elif defined(__GNUC__) || defined(__INTEL_COMPILER)
 #include "port/atomics/generic-gcc.h"
 #elif defined(WIN32_ONLY_COMPILER)
 #include "port/atomics/generic-msvc.h"
@@ -90,8 +97,6 @@
 #include "port/atomics/generic-acc.h"
 #elif defined(__SUNPRO_C) && !defined(__GNUC__)
 #include "port/atomics/generic-sunpro.h"
-#elif (defined(__IBMC__) || defined(__IBMCPP__)) && !defined(__GNUC__)
-#include "port/atomics/generic-xlc.h"
 #else
 /*
  * Unsupported compiler, we'll likely use slower fallbacks... At least
#118Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#117)
1 attachment(s)
Re: xlc atomics

On Sat, Jul 04, 2015 at 08:07:49PM -0400, Noah Misch wrote:

On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:

On 2015-07-04 18:40:41 -0400, Noah Misch wrote:

(1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)". Getting it working
required the attached patch.

Will you apply? Having the ability to test change seems to put you in a
much better spot then me.

I will.

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures. The compiler has always been xlC, and
the branch has been HEAD or 9.5. Examples:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&amp;dt=2015-12-15%2004%3A12%3A49
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&amp;dt=2015-12-08%2001%3A20%3A16
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2015-12-11%2005%3A25%3A54

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

https://github.com/boostorg/smart_ptr/blob/boost-1.60.0/include/boost/smart_ptr/detail/sp_counted_base_vacpp_ppc.hpp#L55
http://redmine.gromacs.org/projects/gromacs/repository/entry/src/external/thread_mpi/include/thread_mpi/atomic/xlc_ppc.h?rev=v5.1.2#L112
https://www-01.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/asm_example.html

This patch's test case would have failed about half the time under today's
generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug
99% of the time would run far longer, hence this compromise.

Archaeology enthusiasts may enjoy the bug's changing presentation over time.
LWLockAcquire() exposed it after commit ab5194e redesigned lwlock.c in terms
of atomics. Assert builds were unaffected for commits in [ab5194e, bbfd7ed);
atomics.h asserts fattened code enough to mask the bug. However, removing the
AssertPointerAlignment() from pg_atomic_fetch_or_u32() would have sufficed to
surface it in assert builds. All builds demonstrated the bug once bbfd7ed
introduced xlC pg_attribute_noreturn(), which elicits a simpler instruction
sequence around the ExceptionalCondition() call embedded in each Assert.

nm

Attachments:

isync-xlc-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/bin/pgbench/.gitignore b/src/bin/pgbench/.gitignore
index aae819e..983a3cd 100644
--- a/src/bin/pgbench/.gitignore
+++ b/src/bin/pgbench/.gitignore
@@ -1,3 +1,4 @@
 /exprparse.c
 /exprscan.c
 /pgbench
+/tmp_check/
diff --git a/src/bin/pgbench/Makefile b/src/bin/pgbench/Makefile
index 18fdf58..e9b1b74 100644
--- a/src/bin/pgbench/Makefile
+++ b/src/bin/pgbench/Makefile
@@ -40,3 +40,9 @@ clean distclean:
 
 maintainer-clean: distclean
 	rm -f exprparse.c exprscan.c
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pgbench/t/001_pgbench.pl b/src/bin/pgbench/t/001_pgbench.pl
new file mode 100644
index 0000000..88e83ab
--- /dev/null
+++ b/src/bin/pgbench/t/001_pgbench.pl
@@ -0,0 +1,25 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+# Test concurrent insertion into table with UNIQUE oid column.  DDL expects
+# GetNewOidWithIndex() to successfully avoid violating uniqueness for indexes
+# like pg_class_oid_index and pg_proc_oid_index.  This indirectly exercises
+# LWLock and spinlock concurrency.  This test makes a 5-MiB table.
+my $node = get_new_node('main');
+$node->init;
+$node->start;
+$node->psql('postgres',
+	    'CREATE UNLOGGED TABLE oid_tbl () WITH OIDS; '
+	  . 'ALTER TABLE oid_tbl ADD UNIQUE (oid);');
+my $script = $node->basedir . '/pgbench_script';
+append_to_file($script,
+	'INSERT INTO oid_tbl SELECT FROM generate_series(1,1000);');
+$node->command_like(
+	[   qw(pgbench --no-vacuum --client=5 --protocol=prepared
+		  --transactions=25 --file), $script ],
+	qr{processed: 125/125},
+	'concurrent OID generation');
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
index 1e8a11e..f24e3af 100644
--- a/src/include/port/atomics/generic-xlc.h
+++ b/src/include/port/atomics/generic-xlc.h
@@ -41,18 +41,22 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 									uint32 *expected, uint32 newval)
 {
 	/*
+	 * XXX: __compare_and_swap is defined to take signed parameters, but that
+	 * shouldn't matter since we don't perform any arithmetic operations.
+	 */
+	bool		ret = __compare_and_swap((volatile int*)&ptr->value,
+										 (int *)expected, (int)newval);
+
+	/*
 	 * xlc's documentation tells us:
 	 * "If __compare_and_swap is used as a locking primitive, insert a call to
 	 * the __isync built-in function at the start of any critical sections."
+	 *
+	 * The critical section begins immediately after __compare_and_swap().
 	 */
 	__isync();
 
-	/*
-	 * XXX: __compare_and_swap is defined to take signed parameters, but that
-	 * shouldn't matter since we don't perform any arithmetic operations.
-	 */
-	return __compare_and_swap((volatile int*)&ptr->value,
-							  (int *)expected, (int)newval);
+	return ret;
 }
 
 #define PG_HAVE_ATOMIC_FETCH_ADD_U32
@@ -69,10 +73,12 @@ static inline bool
 pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 									uint64 *expected, uint64 newval)
 {
+	bool		ret = __compare_and_swaplp((volatile long*)&ptr->value,
+										   (long *)expected, (long)newval);
+
 	__isync();
 
-	return __compare_and_swaplp((volatile long*)&ptr->value,
-								(long *)expected, (long)newval);;
+	return ret;
 }
 
 #define PG_HAVE_ATOMIC_FETCH_ADD_U64
diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat
index feb0fe5..d21692f 100755
--- a/src/tools/msvc/clean.bat
+++ b/src/tools/msvc/clean.bat
@@ -94,6 +94,7 @@ if exist src\bin\pg_basebackup\tmp_check rd /s /q src\bin\pg_basebackup\tmp_chec
 if exist src\bin\pg_config\tmp_check rd /s /q src\bin\pg_config\tmp_check
 if exist src\bin\pg_ctl\tmp_check rd /s /q src\bin\pg_ctl\tmp_check
 if exist src\bin\pg_rewind\tmp_check rd /s /q src\bin\pg_rewind\tmp_check
+if exist src\bin\pgbench\tmp_check rd /s /q src\bin\pgbench\tmp_check
 if exist src\bin\scripts\tmp_check rd /s /q src\bin\scripts\tmp_check
 
 REM Clean up datafiles built with contrib
#119Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#118)
Re: xlc atomics

On 2016-02-15 12:11:29 -0500, Noah Misch wrote:

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

Ugh. You're right! It's about not moving code before the stwcx...

Do you want to add the new test (no objection, curious), or is that more
for testing?

Greetings,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#119)
Re: xlc atomics

On Mon, Feb 15, 2016 at 06:33:42PM +0100, Andres Freund wrote:

On 2016-02-15 12:11:29 -0500, Noah Misch wrote:

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

Ugh. You're right! It's about not moving code before the stwcx...

Do you want to add the new test (no objection, curious), or is that more
for testing?

The patch's test would join PostgreSQL indefinitely.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#118)
Re: xlc atomics

Noah Misch <noah@leadboat.com> writes:

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures.

I'd been wondering about those ...

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

Nice catch!

This patch's test case would have failed about half the time under today's
generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug
99% of the time would run far longer, hence this compromise.

Sounds like a reasonable compromise to me, although I wonder about the
value of it if we stick it into pgbench's TAP tests. How many of the
slower buildfarm members are running the TAP tests? Certainly mine are
not.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Craig Ringer
craig@2ndquadrant.com
In reply to: Andres Freund (#116)
Re: xlc atomics

On 5 July 2015 at 06:54, Andres Freund <andres@anarazel.de> wrote:

I wrote this entirely blindly, as evidenced here by the changes you
needed. At the time somebody had promised to soon put up an aix animal,
but that apparently didn't work out.

Similarly, I asked IBM for XL/C for a POWER7 Linux VM to improve test cover
on the build farm but was told to just use gcc. It was with regards to
testing hardware decfloat support and the folks I spoke to said that gcc's
support was derived from that in xl/c anyway.

They're not desperately keen to get their products out there for testing.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#123Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#121)
1 attachment(s)
Re: xlc atomics

On Mon, Feb 15, 2016 at 02:02:41PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug. Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures.

I'd been wondering about those ...

These suggested OidGenLock wasn't doing its job. I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le. The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap(). __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:

Nice catch!

This patch's test case would have failed about half the time under today's
generic-xlc.h. Fast machines run it in about 1s. A test that detects the bug
99% of the time would run far longer, hence this compromise.

Sounds like a reasonable compromise to me, although I wonder about the
value of it if we stick it into pgbench's TAP tests. How many of the
slower buildfarm members are running the TAP tests? Certainly mine are
not.

I missed a second synchronization bug in generic-xlc.h, but the pgbench test
suite caught it promptly after commit 008608b. Buildfarm members mandrill and
hornet failed[1]http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2016-04-11%2003%3A14%3A13 or deadlocked within two runs. The bug is that the
pg_atomic_compare_exchange_*() specifications grant "full barrier semantics",
but generic-xlc.h provided only the semantics of an acquire barrier. Commit
008608b exposed that older problem; its LWLockWaitListUnlock() relies on
pg_atomic_compare_exchange_u32() (via pg_atomic_fetch_and_u32()) for release
barrier semantics.

The fix is to issue "sync" before each compare-and-swap, in addition to the
"isync" after it. This is consistent with the corresponding GCC atomics. The
pgbench test suite survived dozens of runs so patched.

[1]: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&amp;dt=2016-04-11%2003%3A14%3A13

Attachments:

full-barrier-xlc-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
index f24e3af..f4fd2f3 100644
--- a/src/include/port/atomics/generic-xlc.h
+++ b/src/include/port/atomics/generic-xlc.h
@@ -40,12 +40,22 @@ static inline bool
 pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 									uint32 *expected, uint32 newval)
 {
+	bool		ret;
+
+	/*
+	 * atomics.h specifies sequential consistency ("full barrier semantics")
+	 * for this interface.  Since "lwsync" provides acquire/release
+	 * consistency only, do not use it here.  GCC atomics observe the same
+	 * restriction; see its rs6000_pre_atomic_barrier().
+	 */
+	__asm__ __volatile__ ("	sync \n" ::: "memory");
+
 	/*
 	 * XXX: __compare_and_swap is defined to take signed parameters, but that
 	 * shouldn't matter since we don't perform any arithmetic operations.
 	 */
-	bool		ret = __compare_and_swap((volatile int*)&ptr->value,
-										 (int *)expected, (int)newval);
+	ret = __compare_and_swap((volatile int*)&ptr->value,
+							 (int *)expected, (int)newval);
 
 	/*
 	 * xlc's documentation tells us:
@@ -63,6 +73,10 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 static inline uint32
 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 {
+	/*
+	 * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby
+	 * providing sequential consistency.  This is undocumented.
+	 */
 	return __fetch_and_add((volatile int *)&ptr->value, add_);
 }
 
@@ -73,8 +87,12 @@ static inline bool
 pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 									uint64 *expected, uint64 newval)
 {
-	bool		ret = __compare_and_swaplp((volatile long*)&ptr->value,
-										   (long *)expected, (long)newval);
+	bool		ret;
+
+	__asm__ __volatile__ ("	sync \n" ::: "memory");
+
+	ret = __compare_and_swaplp((volatile long*)&ptr->value,
+							   (long *)expected, (long)newval);
 
 	__isync();
 
#124Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#123)
Re: xlc atomics

On 2016-04-23 21:54:07 -0400, Noah Misch wrote:

I missed a second synchronization bug in generic-xlc.h, but the pgbench test
suite caught it promptly after commit 008608b.

Nice catch.

The bug is that the pg_atomic_compare_exchange_*() specifications
grant "full barrier semantics", but generic-xlc.h provided only the
semantics of an acquire barrier.

I find the docs at
http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html
to be be awfully silent about that matter. I guess you just looked at
the assembler code? It's nice that one can figure out stuff like that
from an architecture manual, but it's sad that the docs for the
intrinsics is silent about that matter.

I do wonder if I shouldn't have left xlc fall by the wayside, and make
it use the fallback implementation. Given it's popularity (or rather
lack thereof) that'd probably have been a better choice. But that ship
has sailed.

Except that I didn't verify the rs6000_pre_atomic_barrier() and
__fetch_and_add() internals about emitted sync/isync, the patch looks
good. We've so far not referred to "sequential consistency", but given
it's rise in popularity, I don't think it hurts.

Stricly speaking generic-xlc.c shouldn't contain inline-asm, but given
xlc is only there for power...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#124)
Re: xlc atomics

On Mon, Apr 25, 2016 at 11:52:04AM -0700, Andres Freund wrote:

On 2016-04-23 21:54:07 -0400, Noah Misch wrote:

The bug is that the pg_atomic_compare_exchange_*() specifications
grant "full barrier semantics", but generic-xlc.h provided only the
semantics of an acquire barrier.

I find the docs at
http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html
to be be awfully silent about that matter. I guess you just looked at
the assembler code? It's nice that one can figure out stuff like that
from an architecture manual, but it's sad that the docs for the
intrinsics is silent about that matter.

Right. The intrinsics provide little value as abstractions if one checks the
generated code to deduce how to use them. I was tempted to replace the
intrinsics calls with inline asm. At least these functions are unlikely to
change over time.

Except that I didn't verify the rs6000_pre_atomic_barrier() and
__fetch_and_add() internals about emitted sync/isync, the patch looks
good. We've so far not referred to "sequential consistency", but given
it's rise in popularity, I don't think it hurts.

Thanks; committed.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers