memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Started by Robert Haasover 14 years ago26 messages

robertmhaas@gmail.com

over 14 years ago

1 attachment(s)

On Mon, Aug 8, 2011 at 7:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

And, here's an updated version, with some of the more obviously broken
things fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

barrier-v2.patchapplication/octet-stream; name=barrier-v2.patchDownload

diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
new file mode 100644
index 0000000..c7042d4
--- /dev/null
+++ b/src/include/storage/barrier.h
@@ -0,0 +1,170 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.h
+ *	  Memory barrier operations.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMIC_H
+#define ATOMIC_H
+
+#include "storage/s_lock.h"
+
+extern slock_t	dummy_spinlock;
+
+/*
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ * 
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.  Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+
+#if defined(DISABLE_ATOMICS)
+
+/*
+ * Fall through to the spinlock-based implementation.
+ */
+
+#elif defined(__INTEL_COMPILER)
+
+/*
+ * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
+ */
+#define pg_memory_barrier()		_mm_mfence()
+#define pg_compiler_barrier()	__memory_barrier()
+
+#elif defined(__GNUC__)
+
+/* This works on any architecture, since it's only talking to GCC itself. */
+#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
+
+#if defined(__i386__) || defined(__x86_64__)		/* 32 or 64 bit x86 */
+
+/*
+ * x86 and x86_64 do not allow loads to be reorded with other loads, or
+ * stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#define pg_memory_barrier()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define pg_read_barrier()		pg_compiler_barrier()
+#define pg_write_barrier()		pg_compiler_barrier()
+
+#elif defined(__ia64__) || defined(__ia64)
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
+
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#elif defined(__alpha) || defined(__alpha__)      /* Alpha */
+
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+
+#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#define pg_memory_barrier()		__sync_synchronize()
+
+#endif
+
+#elif defined(__ia64__) || defined(__ia64)
+
+#define pg_compiler_barrier()	_Asm_sched_fence()
+#define pg_memory_barrier()		_Asm_mf()
+
+#elif defined(WIN32_ONLY_COMPILER)
+
+/*
+ * Should work on MSVC and Borland.  Won't work on Visual Studio 2003, but
+ * we don't support that anyway.
+ */
+#include <intrin.h>
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier()	_ReadWriteBarrier()
+#define pg_memory_barrier()		MemoryBarrier()
+
+#endif
+
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#if !defined(pg_memory_barrier)
+#define pg_memory_barrier(x) \
+	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
+#endif
+
+/*
+ * If any of the weaker barrier operations are undefined, we upgrade them
+ * to full memory barriers.
+ */
+#if !defined(pg_compiler_barrier)
+#define pg_compiler_barrier()		pg_memory_barrier()
+#endif
+#if !defined(pg_read_barrier)
+#define pg_read_barrier()			pg_memory_barrier()
+#endif
+#if !defined(pg_write_barrier)
+#define pg_write_barrier()			pg_memory_barrier()
+#endif
+
+#endif   /* ATOMIC_H */

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Robert Haas (#1)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On 14.09.2011 23:29, Robert Haas wrote:

On Mon, Aug 8, 2011 at 7:47 AM, Robert Haas<robertmhaas@gmail.com> wrote:

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

And, here's an updated version, with some of the more obviously broken
things fixed.

s/atomic/barrier/

+/*
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.

This seems like a strange way to explain the problem. I would suggest
structuring those paragraphs along the lines of:

"
A PostgreSQL memory barrier guarantees that any loads/stores before the
barrier are completely finished and visible to other CPUs, before the
loads/stores after the barrier are performed.

That involves two things: 1. We must stop the compiler from rearranging
loads/stores across the barrier. 2. We must stop the CPU from reordering
the loads/stores across the barrier.
"

Do we have any use for compiler barriers that are not also memory
barriers? If not, I would suggest not exposing the pg_compiler_barrier()
macro, but keep that as an implementation detail in the implementations
of pg_memory_barrier().

Some examples on the correct usage of these barriers would be nice, too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#2)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 15, 2011 at 11:57 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

s/atomic/barrier/

Oops.

+/*
+ * A compiler barrier need not (and preferably should not) emit any
actual
+ * machine code, but must act as an optimization fence: the compiler must
not
+ * reorder loads or stores to main memory around the barrier.  However,
the
+ * CPU may still reorder loads or stores at runtime, if the
architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
This seems like a strange way to explain the problem. I would suggest
structuring those paragraphs along the lines of:

"
A PostgreSQL memory barrier guarantees that any loads/stores before the
barrier are completely finished and visible to other CPUs, before the
loads/stores after the barrier are performed.

That involves two things: 1. We must stop the compiler from rearranging
loads/stores across the barrier. 2. We must stop the CPU from reordering the
loads/stores across the barrier.
"

That doesn't seem much different than I wrote?

One thing to keep in mind about whatever language we use here is that
memory barrier instructions need not (and often do not) compel the CPU
to "completely finish" anything before doing the next thing.
Execution is heavily pipelined, and on a sequence like STORE/WRITE
BARRIER/STORE the CPU is perfectly well entitled to begin the second
store before the first one is done. It just can't let the second
STORE get far enough for other CPUs to see it.

Do we have any use for compiler barriers that are not also memory barriers?
If not, I would suggest not exposing the pg_compiler_barrier() macro, but
keep that as an implementation detail in the implementations of
pg_memory_barrier().

I think there might be some use for a pure compiler barrier, but I'm
not sure yet. It's probably not worth having a "fallback"
implementation involving a spinlock, though, because odds are good
that any code that is performance-critical enough to be attempting to
safely use a compiler barrier can't survive having a spinlock
acquire-and-release shoved in there, so any such code should probably
be #ifdef'd.

Some examples on the correct usage of these barriers would be nice, too.

That's a big topic. A trivial example is that if you write:

a[*x] = i;
++*x;

...where x and i are pointers into shared memory, another backend
might loop over a from 0 to x-1 and accidentally read off the end of
the array.

But even a full explanation of that case seems like almost too much
for the comment of a header file, and there are certainly far more
complex cases. I think anyone who is using these primitives is going
to have to do some independent reading...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Robert Haas (#3)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Robert Haas <robertmhaas@gmail.com> wrote:

But even a full explanation of that case seems like almost too
much for the comment of a header file, and there are certainly far
more complex cases. I think anyone who is using these primitives
is going to have to do some independent reading...

Maybe a URL or two in the header comments, pointing to relevant
papers for the techniques used? After all, years from now someone
might be familiar with other techniques from newer papers and wonder
what the techniques in the code are based on.

-Kevin

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Kevin Grittner (#4)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Wed, Sep 21, 2011 at 2:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

But even a full explanation of that case seems like almost too
much for the comment of a header file, and there are certainly far
more complex cases. I think anyone who is using these primitives
is going to have to do some independent reading...

Maybe a URL or two in the header comments, pointing to relevant
papers for the techniques used? After all, years from now someone
might be familiar with other techniques from newer papers and wonder
what the techniques in the code are based on.

If there are any academic papers on this topic, I haven't found them.
Mostly, I've found lots of articles written by people who were coding
for the Linux kernel, and a lot of those articles are extremely
Linux-specific (you could use the smb_rb() macro here, but it's better
to instead use this RCU-related macro, because it'll do it for you,
blah blah). I'm happy to link to any sources anyone thinks we ought
to link to, but I've mostly had to piece this together bit by bit from
blog posts and (sometimes buggy) examples. It hasn't been a
particularly thrilling exercise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Robert Haas (#5)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 21, 2011 at 2:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

But even a full explanation of that case seems like almost too
much for the comment of a header file, and there are certainly
far more complex cases. I think anyone who is using these
primitives is going to have to do some independent reading...

Maybe a URL or two in the header comments, pointing to relevant
papers for the techniques used? After all, years from now
someone might be familiar with other techniques from newer papers
and wonder what the techniques in the code are based on.

If there are any academic papers on this topic, I haven't found
them. Mostly, I've found lots of articles written by people who
were coding for the Linux kernel, and a lot of those articles are
extremely Linux-specific (you could use the smb_rb() macro here,
but it's better to instead use this RCU-related macro, because
it'll do it for you, blah blah). I'm happy to link to any sources
anyone thinks we ought to link to, but I've mostly had to piece
this together bit by bit from blog posts and (sometimes buggy)
examples. It hasn't been a particularly thrilling exercise.

Well, if it really is that hard to piece together the relevant
techniques, it seems cruel not to check in the results of your
efforts to work it out somewhere. Perhaps a README file?

On the other hand, a search turned up these two papers (which I
haven't yet read, but will):

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.2397&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.6657&rep=rep1&type=pdf

On a quick scan, they both look promising in themselves, and
reference a lot of other promising-sounding papers.

-Kevin

Gurjeet Singh

singh.gurjeet@gmail.com

over 14 years ago

In reply to: Robert Haas (#1)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Wed, Sep 14, 2011 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 8, 2011 at 7:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

And, here's an updated version, with some of the more obviously broken
things fixed.

You declare dummy_spinlock variable as extren and use it, but it is not
defined anywhere. Wouldn't that be a linker error?

--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Gurjeet Singh (#7)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Wed, Sep 21, 2011 at 4:19 PM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:

On Wed, Sep 14, 2011 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 8, 2011 at 7:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

And, here's an updated version, with some of the more obviously broken
things fixed.

You declare dummy_spinlock variable as extren and use it, but it is not
defined anywhere. Wouldn't that be a linker error?

Yeah, we need to add that somewhere, maybe s_lock.c

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Kevin Grittner (#6)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Wed, Sep 21, 2011 at 3:39 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 21, 2011 at 2:48 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

But even a full explanation of that case seems like almost too
much for the comment of a header file, and there are certainly
far more complex cases. I think anyone who is using these
primitives is going to have to do some independent reading...

Maybe a URL or two in the header comments, pointing to relevant
papers for the techniques used? After all, years from now
someone might be familiar with other techniques from newer papers
and wonder what the techniques in the code are based on.

If there are any academic papers on this topic, I haven't found
them. Mostly, I've found lots of articles written by people who
were coding for the Linux kernel, and a lot of those articles are
extremely Linux-specific (you could use the smb_rb() macro here,
but it's better to instead use this RCU-related macro, because
it'll do it for you, blah blah). I'm happy to link to any sources
anyone thinks we ought to link to, but I've mostly had to piece
this together bit by bit from blog posts and (sometimes buggy)
examples. It hasn't been a particularly thrilling exercise.

Well, if it really is that hard to piece together the relevant
techniques, it seems cruel not to check in the results of your
efforts to work it out somewhere. Perhaps a README file?

I don't know if techniques is the right word. I mean, there are three
questions that you might want to answer here:

1. I have a new architecture and I want barrier.h to support it. What
do I need to do?
2. What is the behavior of the various constructs provided by barrier.h?
3. Why would I want that behavior and how can I use it to do cool stuff?

I intended the comment in that file to be enough to answer questions
#1 and #2. What you and Heikki are asking about is really #3, and
that seems to me to be setting the bar awfully high. I mean, lwlock.c
explains what a lightweight lock does, but it doesn't explain all of
the ways that a lightweight lock can be used to solve performance
problems, nor should it. That would be recapitulating the
documentation that is hopefully present everywhere that LWLocks are
used as well as speculating about future applications. It just
doesn't seem sensible to me to try to enumerate all the ways that a
fundamental primitive can potentially be used down the line.

What I found hard about memory barriers is basically this: I didn't
understand that the CPU is allowed to perform operations out of order.
And I couldn't understand what the consequences of that fact were. I
sort of understood - but hadn't really internalized - the idea that
execution is highly pipelined, so the single moment at which an
execution is performed is not well defined. Before I really got my
head around it, I had to read the explanations of what a memory
barrier actually does over and over again. I probably read ten
different explanations saying the same thing in different words about
twenty times a piece, and slowly the light dawned. I did my best to
explain that in the existing comment; I'm happy to expand the comment
if people have suggestions for what to put in there; but basically I
think this is a hard concept and if you haven't done this stuff before
it's going to take a while to get up to speed.

On the other hand, a search turned up these two papers (which I
haven't yet read, but will):

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.2397&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.6657&rep=rep1&type=pdf

On a quick scan, they both look promising in themselves, and
reference a lot of other promising-sounding papers.

I'm not sure these are much help for learning how to program with
memory barriers, but if somebody really wants them (or something else)
included, I'm not going to fight too hard. I don't expect this to be
perfect the first time through; I am just trying to get a basic
infrastructure in place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Robert Haas (#9)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Robert Haas <robertmhaas@gmail.com> wrote:

there are three questions that you might want to answer here:

1. I have a new architecture and I want barrier.h to support it.
What do I need to do?
2. What is the behavior of the various constructs provided by
barrier.h?
3. Why would I want that behavior and how can I use it to do cool
stuff?

I intended the comment in that file to be enough to answer
questions #1 and #2. What you and Heikki are asking about is
really #3, and that seems to me to be setting the bar awfully
high.

OK, put that way, I mostly agree. A general overview of the known
usage patterns in a header or README file still doesn't seem out of
line to me. With LW locks, for example, I've only seen two patterns
used in PostgreSQL:

(1) Get a shared lock for reads and an exclusive lock for writes, or
(2) get a shared lock to read or write your own data, but an
exclusive lock to read anyone else's data.

In both cases, there must be a defined order of lock acquisition to
avoid deadlock, with a strong recommendation that locks be released
in the reverse order. Mentioning that much doesn't preclude other
uses of LW locks, but might help people new to the code. That's the
level of summary *I* would like to see included.

What I found hard about memory barriers is basically this: I
didn't understand that the CPU is allowed to perform operations
out of order. And I couldn't understand what the consequences of
that fact were. I sort of understood - but hadn't really
internalized - the idea that execution is highly pipelined, so the
single moment at which an execution is performed is not well
defined. Before I really got my head around it, I had to read the
explanations of what a memory barrier actually does over and over
again. I probably read ten different explanations saying the same
thing in different words about twenty times a piece, and slowly
the light dawned. I did my best to explain that in the existing
comment; I'm happy to expand the comment if people have
suggestions for what to put in there; but basically I think this
is a hard concept and if you haven't done this stuff before it's
going to take a while to get up to speed.

That's the sort of thing where it would be helpful to provide one or
two URLs for cogent explanations of this. Even if it takes repeated
readings and meditations on the explanations for it to sink in, this
is worth it. (For SSI I had to read the paper many times, and then
go read several referenced papers, before I really had my head
around it, and I've had others say the same thing. But having a
link to the material gives someone a chance to *do* that.)

-Kevin

#11

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Kevin Grittner (#10)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Wed, Sep 21, 2011 at 5:07 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

That's the sort of thing where it would be helpful to provide one or
two URLs for cogent explanations of this. Even if it takes repeated
readings and meditations on the explanations for it to sink in, this
is worth it. (For SSI I had to read the paper many times, and then
go read several referenced papers, before I really had my head
around it, and I've had others say the same thing. But having a
link to the material gives someone a chance to *do* that.)

Hmm....

These might be a good place to start, although the first one is
somewhat Linux-kernel specific:

http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

There's also a reasonably cogent explanation in the Linux kernel
itself, in Documentation/memory-barriers.txt

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Peter Geoghegan

peter@2ndquadrant.com

over 14 years ago

In reply to: Robert Haas (#1)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On 14 September 2011 21:29, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 8, 2011 at 7:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

And, here's an updated version, with some of the more obviously broken
things fixed.

As I've already pointed out, the comment "Won't work on Visual Studio
2003" is not accurate:

http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.71).aspx

Besides, if it's not supported, why bother mentioning it?

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#13

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Peter Geoghegan (#12)

1 attachment(s)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 10:53 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

As I've already pointed out, the comment "Won't work on Visual Studio
2003" is not accurate:

http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.71).aspx

Besides, if it's not supported, why bother mentioning it?

I mentioned it because it took me a long time to figure out whether it
was supported or not, and I finally came to the conclusion that it
wasn't. I stand corrected, though; I've now removed that reference.
Sorry for not jumping on it sooner; it was still vaguely on my list of
things to fix at some point, but it hadn't percolated to the top yet.

The attached version (hopefully) fixes various other things people
have complained about as well, including:

- Heikki's complaint about sometimes writing atomic instead of barrier
(which was leftovers), and
- Gurjeet's complaint that I hadn't defined the variable anywhere

I've also added a lengthy README file to the patch that attempts to
explain how barriers should be used in PostgreSQL coding. It's
certainly not a comprehensive treatment of the topic, but hopefully
it's enough to get people oriented. I've attempted to tailor it a bit
to PostgreSQL conventions, like talking about shared memory vs.
backend-private memory instead of assuming (as a number of other
discussions of this topic do) a thread model. It also includes some
advice about when memory barriers shouldn't be used or won't work, and
some references for further reading.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

barrier-v3.patchapplication/octet-stream; name=barrier-v3.patchDownload

diff --git a/src/backend/storage/lmgr/README.barrier b/src/backend/storage/lmgr/README.barrier
new file mode 100644
index 0000000..a4981d6
--- /dev/null
+++ b/src/backend/storage/lmgr/README.barrier
@@ -0,0 +1,199 @@
+Memory Barriers
+===============
+
+Modern CPUs make extensive use of pipe-lining and out-of-order execution,
+meaning that the CPU is often executing more than one instruction at a
+time, and not necessarily in the order that the source code would suggest.
+Furthermore, even before the CPU gets a chance to reorder operations, the
+compiler may (and often does) reorganize the code for greater efficiency,
+particularly at higher optimization levels.  Optimizing compilers and
+out-of-order execution are both critical for good performance, but they
+can lead to surprising results when multiple processes access the same
+memory space.
+
+Example
+=======
+
+Suppose x is a pointer to a structure stored in shared memory, and that the
+entire structure has been initialized to zero bytes.  One backend executes
+the following code fragment:
+
+    x->foo = 1;
+    x->bar = 1;
+
+Meanwhile, at approximately the same time, another backend executes this
+code fragment:
+
+    bar = x->bar;
+    foo = x->foo;
+
+The second backend might end up with foo = 1 and bar = 1 (if it executes
+both statements after the first backend), or with foo = 0 and bar = 0 (if
+it executes both statements before the first backend), or with foo = 1 and
+bar = 0 (if the first backend executes the first statement, the second
+backend executes both statements, and then the first backend executes the
+second statement).
+
+Surprisingly, however, the second backend could also end up with foo = 0
+and bar = 1.  The compiler might swap the order of the two stores performed
+by the first backend, or the two loads performed by the second backend.
+Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
+or Itanium) the CPU might choose to execute either the loads or the stores
+out of order.  This surprising result can lead to bugs.
+
+A common pattern where this actually does result in a bug is when adding items
+onto a queue.  The writer does this:
+
+    q->items[q->num_items] = new_item;
+    ++q->num_items;
+
+The reader does this:
+
+    num_items = q->num_items;
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+This code turns out to be unsafe, because the writer might increment
+q->num_items before it finishes storing the new item into the appropriate slot.
+More subtly, the reader might prefetch the contents of the q->items array
+before reading q->num_items.  Thus, there's still a bug here *even if the
+writer does everything in the order we expect*.  We need the writer to update
+the array before bumping the item counter, and the reader to examine the item
+counter before examining the array.
+
+Note that these types of highly counterintuitive bugs can *only* occur when
+multiple processes are interacting with the same memory segment.  A given
+process always perceives its *own* writes to memory in program order.
+
+Avoiding Memory Ordering Bugs
+=============================
+
+The simplest (and often best) way to avoid memory ordering bugs is to
+protect the data structures involved with an lwlock.  For more details, see
+src/backend/storage/lmgr/README.  For instance, in the above example, the
+writer could acquire an lwlock in exclusive mode before appending to the
+queue, and each reader could acquire the same lock in shared mode before
+reading it.  If the data structure is not heavily trafficked, this solution is
+generally entirely adequate.
+
+However, in some cases, it is desirable to avoid the overhead of acquiring
+and releasing locks.  In this case, memory barriers may be used to ensure
+that the apparent order of execution is as the programmer desires.   In
+PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
+this result.  In the example above, we can prevent the reader from seeing a
+garbage value by having the writer do this:
+
+    q->items[q->num_items] = new_item;
+    pg_memory_barrier();
+    ++q->num_items;
+
+And by having the reader do this:
+
+    num_items = q->num_items;
+    pg_memory_barrier();
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
+the code in such a way as to allow the memory accesses to occur out of order
+and (2) generate any code (often, inline assembly) that is needed to prevent
+the CPU from executing the memory accesses out of order.  Specifically, the
+barrier prevents loads and stores written after the barrier from being
+performed before the barrier, and visca-versa.
+
+Although this code will work, it is needlessly inefficient.  On systems with
+strong memory ordering (such as x86), the CPU never reorders loads with other
+laods, nor stores with other stores.  It can, however, allow a load to
+performed before a subsequent store.  To avoid emitting unnecessary memory
+instructions, we provide two additional primitives: pg_read_barrier(), and
+pg_write_barrier().  When a memory barrier is being used to separate two
+loads, use pg_read_barrier(); when it is separating two stores, use
+pg_write_barrier(); when it is a separating a load and a store (in either
+order), use pg_memory_barrier().  pg_memory_barrier() can always substitute
+for either a read or a write barrier, but is typically more expensive, and
+therefore should be used only when needed.
+
+With these guidelines in mind, the writer can do this:
+
+    q->items[q->num_items] = new_item;
+    pg_write_barrier();
+    ++q->num_items;
+
+And the reader can do this:
+
+    num_items = q->num_items;
+    pg_read_barrier();
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+On machines with strong memory ordering, these weaker barriers will simply
+prevent compiler rearrangement, without emitting any actual machine code.
+On machines with weak memory ordering, they will will prevent compiler
+reordering and also emit whatever hardware barrier may be required.  Even
+on machines with weak memory ordering, a read or write barrier may be able
+to use a less expensive instruction than a full barrier.
+
+Weaknesses of Memory Barriers
+=============================
+
+While memory barriers are a powerful tool, and much cheaper than locks, they
+are also much less capable than locks.  Here are some of the problems.
+
+1. Concurrent writers are unsafe.  In the above example of a queue, using
+memory barriers doesn't make it safe for two processes to add items to the
+same queue at the same time.  If more than one process can write to the queue,
+a spinlock or lwlock must be used to synchronize access. The readers can
+perhaps proceed without any lock, but the writers may not.
+
+Even very simple write operations often require additional synchronization.
+For example, it's not safe for multiple writers to simultaneously execute
+this code (supposing x is a pointer into shared memory):
+
+    x->foo++;
+
+Although this may compile down to a single machine-language instruction,
+the CPU will execute that instruction by reading the current value of foo,
+adding one to it, and then storing the result back to the original address.
+If two CPUs try to do this simultaneously, both may do their reads before
+either one does their writes.  Eventually we might be able to use an atomic
+fetch-and-add instruction for this specific case on architectures that support
+it, but we can't rely on that being available everywhere, and we currently
+have no support for it at all.  Use a lock.
+
+2. Eight-byte loads and stores aren't necessarily atomic.  We assume in
+various places in the source code that an aligned four-byte load or store is
+atomic, and that other processes therefore won't see a half-set value.
+Sadly, the same can't be said for eight-byte value: on some platforms, an
+aligned eight-byte load or store will generate two four-byte operations.  If
+you need an atomic eight-byte read or write, you must make it atomic with a
+lock.
+
+3. No ordering guarantees.  While memory barriers ensure that any given
+process performs loads and stores to shared memory in order, they don't
+guarantee synchronization.  In the queue example above, we can use memory
+barriers to be sure that readers won't see garbage, but there's nothing to
+say whether a given reader will run before or after a given writer.  If this
+matters in a given situation, some other mechanism must be used instead of
+or in addition to memory barriers.
+
+4. Barrier proliferation.  Many algorithms that at first seem appealing
+require multiple barriers.  If the number of barriers required is more than
+one or two, you may be better off just using a lock.  Keep in mind that, on
+some platforms, a barrier may be implemented by acquiring and releasing a
+backend-private spinlock.  This may be better than a centralized lock under
+contention, but it may also be slower in the uncontended case.
+
+Further Reading
+===============
+
+Much of the documentation about memory barriers appears to be quite
+Linux-specific.  The following papers may be helpful:
+
+Memory Ordering in Modern Microprocessors, by Paul E. McKenney
+* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
+
+Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
+* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
+
+The Linux kernel also has some useful documentation on this topic.  Start
+with Documentation/memory-barriers.txt
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 1aa9912..cd1306c 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -20,6 +20,7 @@
 
 #include "storage/s_lock.h"
 
+slock_t  dummy_spinlock;
 
 static int	spins_per_delay = DEFAULT_SPINS_PER_DELAY;
 
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
new file mode 100644
index 0000000..0286817
--- /dev/null
+++ b/src/include/storage/barrier.h
@@ -0,0 +1,171 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.h
+ *	  Memory barrier operations.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/barrier.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BARRIER_H
+#define BARRIER_H
+
+#include "storage/s_lock.h"
+
+extern slock_t	dummy_spinlock;
+
+/*
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ * 
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.  Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ */
+
+#if defined(DISABLE_BARRIERS)
+
+/*
+ * Fall through to the spinlock-based implementation.
+ */
+
+#elif defined(__INTEL_COMPILER)
+
+/*
+ * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
+ */
+#define pg_memory_barrier()		_mm_mfence()
+#define pg_compiler_barrier()	__memory_barrier()
+
+#elif defined(__GNUC__)
+
+/* This works on any architecture, since it's only talking to GCC itself. */
+#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
+
+#if defined(__i386__) || defined(__x86_64__)		/* 32 or 64 bit x86 */
+
+/*
+ * x86 and x86_64 do not allow loads to be reorded with other loads, or
+ * stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#define pg_memory_barrier()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define pg_read_barrier()		pg_compiler_barrier()
+#define pg_write_barrier()		pg_compiler_barrier()
+
+#elif defined(__ia64__) || defined(__ia64)
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
+
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#elif defined(__alpha) || defined(__alpha__)      /* Alpha */
+
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+
+#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#define pg_memory_barrier()		__sync_synchronize()
+
+#endif
+
+#elif defined(__ia64__) || defined(__ia64)
+
+#define pg_compiler_barrier()	_Asm_sched_fence()
+#define pg_memory_barrier()		_Asm_mf()
+
+#elif defined(WIN32_ONLY_COMPILER)
+
+/* Should work on both MSVC and Borland. */
+#include <intrin.h>
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier()	_ReadWriteBarrier()
+#define pg_memory_barrier()		MemoryBarrier()
+
+#endif
+
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#if !defined(pg_memory_barrier)
+#define pg_memory_barrier(x) \
+	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ *
+ * If a compiler barrier is unavailable, you probably don't want a full
+ * memory barrier instead, so if you have a use case for a compiler barrier,
+ * you'd better use #ifdef.
+ */
+#if !defined(pg_read_barrier)
+#define pg_read_barrier()			pg_memory_barrier()
+#endif
+#if !defined(pg_write_barrier)
+#define pg_write_barrier()			pg_memory_barrier()
+#endif
+
+#endif   /* BARRIER_H */

#14

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Robert Haas (#13)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On 22 September 2011 16:18, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 22, 2011 at 10:53 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

As I've already pointed out, the comment "Won't work on Visual Studio
2003" is not accurate:

http://msdn.microsoft.com/en-us/library/f20w0x5e(v=vs.71).aspx

Besides, if it's not supported, why bother mentioning it?

I mentioned it because it took me a long time to figure out whether it
was supported or not, and I finally came to the conclusion that it
wasn't. I stand corrected, though; I've now removed that reference.
Sorry for not jumping on it sooner; it was still vaguely on my list of
things to fix at some point, but it hadn't percolated to the top yet.

The attached version (hopefully) fixes various other things people
have complained about as well, including:

- Heikki's complaint about sometimes writing atomic instead of barrier
(which was leftovers), and
- Gurjeet's complaint that I hadn't defined the variable anywhere

I've also added a lengthy README file to the patch that attempts to
explain how barriers should be used in PostgreSQL coding. It's
certainly not a comprehensive treatment of the topic, but hopefully
it's enough to get people oriented. I've attempted to tailor it a bit
to PostgreSQL conventions, like talking about shared memory vs.
backend-private memory instead of assuming (as a number of other
discussions of this topic do) a thread model. It also includes some
advice about when memory barriers shouldn't be used or won't work, and
some references for further reading.

s/visca-versa/vice-versa/
s/laods/loads/

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Thom Brown (#14)

1 attachment(s)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 11:25 AM, Thom Brown <thom@linux.com> wrote:

s/visca-versa/vice-versa/
s/laods/loads/

Fixed. v4 attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

barrier-v4.patchapplication/octet-stream; name=barrier-v4.patchDownload

diff --git a/src/backend/storage/lmgr/README.barrier b/src/backend/storage/lmgr/README.barrier
new file mode 100644
index 0000000..f9f3593
--- /dev/null
+++ b/src/backend/storage/lmgr/README.barrier
@@ -0,0 +1,199 @@
+Memory Barriers
+===============
+
+Modern CPUs make extensive use of pipe-lining and out-of-order execution,
+meaning that the CPU is often executing more than one instruction at a
+time, and not necessarily in the order that the source code would suggest.
+Furthermore, even before the CPU gets a chance to reorder operations, the
+compiler may (and often does) reorganize the code for greater efficiency,
+particularly at higher optimization levels.  Optimizing compilers and
+out-of-order execution are both critical for good performance, but they
+can lead to surprising results when multiple processes access the same
+memory space.
+
+Example
+=======
+
+Suppose x is a pointer to a structure stored in shared memory, and that the
+entire structure has been initialized to zero bytes.  One backend executes
+the following code fragment:
+
+    x->foo = 1;
+    x->bar = 1;
+
+Meanwhile, at approximately the same time, another backend executes this
+code fragment:
+
+    bar = x->bar;
+    foo = x->foo;
+
+The second backend might end up with foo = 1 and bar = 1 (if it executes
+both statements after the first backend), or with foo = 0 and bar = 0 (if
+it executes both statements before the first backend), or with foo = 1 and
+bar = 0 (if the first backend executes the first statement, the second
+backend executes both statements, and then the first backend executes the
+second statement).
+
+Surprisingly, however, the second backend could also end up with foo = 0
+and bar = 1.  The compiler might swap the order of the two stores performed
+by the first backend, or the two loads performed by the second backend.
+Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
+or Itanium) the CPU might choose to execute either the loads or the stores
+out of order.  This surprising result can lead to bugs.
+
+A common pattern where this actually does result in a bug is when adding items
+onto a queue.  The writer does this:
+
+    q->items[q->num_items] = new_item;
+    ++q->num_items;
+
+The reader does this:
+
+    num_items = q->num_items;
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+This code turns out to be unsafe, because the writer might increment
+q->num_items before it finishes storing the new item into the appropriate slot.
+More subtly, the reader might prefetch the contents of the q->items array
+before reading q->num_items.  Thus, there's still a bug here *even if the
+writer does everything in the order we expect*.  We need the writer to update
+the array before bumping the item counter, and the reader to examine the item
+counter before examining the array.
+
+Note that these types of highly counterintuitive bugs can *only* occur when
+multiple processes are interacting with the same memory segment.  A given
+process always perceives its *own* writes to memory in program order.
+
+Avoiding Memory Ordering Bugs
+=============================
+
+The simplest (and often best) way to avoid memory ordering bugs is to
+protect the data structures involved with an lwlock.  For more details, see
+src/backend/storage/lmgr/README.  For instance, in the above example, the
+writer could acquire an lwlock in exclusive mode before appending to the
+queue, and each reader could acquire the same lock in shared mode before
+reading it.  If the data structure is not heavily trafficked, this solution is
+generally entirely adequate.
+
+However, in some cases, it is desirable to avoid the overhead of acquiring
+and releasing locks.  In this case, memory barriers may be used to ensure
+that the apparent order of execution is as the programmer desires.   In
+PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
+this result.  In the example above, we can prevent the reader from seeing a
+garbage value by having the writer do this:
+
+    q->items[q->num_items] = new_item;
+    pg_memory_barrier();
+    ++q->num_items;
+
+And by having the reader do this:
+
+    num_items = q->num_items;
+    pg_memory_barrier();
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
+the code in such a way as to allow the memory accesses to occur out of order
+and (2) generate any code (often, inline assembly) that is needed to prevent
+the CPU from executing the memory accesses out of order.  Specifically, the
+barrier prevents loads and stores written after the barrier from being
+performed before the barrier, and vice-versa.
+
+Although this code will work, it is needlessly inefficient.  On systems with
+strong memory ordering (such as x86), the CPU never reorders loads with other
+loads, nor stores with other stores.  It can, however, allow a load to
+performed before a subsequent store.  To avoid emitting unnecessary memory
+instructions, we provide two additional primitives: pg_read_barrier(), and
+pg_write_barrier().  When a memory barrier is being used to separate two
+loads, use pg_read_barrier(); when it is separating two stores, use
+pg_write_barrier(); when it is a separating a load and a store (in either
+order), use pg_memory_barrier().  pg_memory_barrier() can always substitute
+for either a read or a write barrier, but is typically more expensive, and
+therefore should be used only when needed.
+
+With these guidelines in mind, the writer can do this:
+
+    q->items[q->num_items] = new_item;
+    pg_write_barrier();
+    ++q->num_items;
+
+And the reader can do this:
+
+    num_items = q->num_items;
+    pg_read_barrier();
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+On machines with strong memory ordering, these weaker barriers will simply
+prevent compiler rearrangement, without emitting any actual machine code.
+On machines with weak memory ordering, they will will prevent compiler
+reordering and also emit whatever hardware barrier may be required.  Even
+on machines with weak memory ordering, a read or write barrier may be able
+to use a less expensive instruction than a full barrier.
+
+Weaknesses of Memory Barriers
+=============================
+
+While memory barriers are a powerful tool, and much cheaper than locks, they
+are also much less capable than locks.  Here are some of the problems.
+
+1. Concurrent writers are unsafe.  In the above example of a queue, using
+memory barriers doesn't make it safe for two processes to add items to the
+same queue at the same time.  If more than one process can write to the queue,
+a spinlock or lwlock must be used to synchronize access. The readers can
+perhaps proceed without any lock, but the writers may not.
+
+Even very simple write operations often require additional synchronization.
+For example, it's not safe for multiple writers to simultaneously execute
+this code (supposing x is a pointer into shared memory):
+
+    x->foo++;
+
+Although this may compile down to a single machine-language instruction,
+the CPU will execute that instruction by reading the current value of foo,
+adding one to it, and then storing the result back to the original address.
+If two CPUs try to do this simultaneously, both may do their reads before
+either one does their writes.  Eventually we might be able to use an atomic
+fetch-and-add instruction for this specific case on architectures that support
+it, but we can't rely on that being available everywhere, and we currently
+have no support for it at all.  Use a lock.
+
+2. Eight-byte loads and stores aren't necessarily atomic.  We assume in
+various places in the source code that an aligned four-byte load or store is
+atomic, and that other processes therefore won't see a half-set value.
+Sadly, the same can't be said for eight-byte value: on some platforms, an
+aligned eight-byte load or store will generate two four-byte operations.  If
+you need an atomic eight-byte read or write, you must make it atomic with a
+lock.
+
+3. No ordering guarantees.  While memory barriers ensure that any given
+process performs loads and stores to shared memory in order, they don't
+guarantee synchronization.  In the queue example above, we can use memory
+barriers to be sure that readers won't see garbage, but there's nothing to
+say whether a given reader will run before or after a given writer.  If this
+matters in a given situation, some other mechanism must be used instead of
+or in addition to memory barriers.
+
+4. Barrier proliferation.  Many algorithms that at first seem appealing
+require multiple barriers.  If the number of barriers required is more than
+one or two, you may be better off just using a lock.  Keep in mind that, on
+some platforms, a barrier may be implemented by acquiring and releasing a
+backend-private spinlock.  This may be better than a centralized lock under
+contention, but it may also be slower in the uncontended case.
+
+Further Reading
+===============
+
+Much of the documentation about memory barriers appears to be quite
+Linux-specific.  The following papers may be helpful:
+
+Memory Ordering in Modern Microprocessors, by Paul E. McKenney
+* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
+
+Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
+* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
+
+The Linux kernel also has some useful documentation on this topic.  Start
+with Documentation/memory-barriers.txt
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 1aa9912..cd1306c 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -20,6 +20,7 @@
 
 #include "storage/s_lock.h"
 
+slock_t  dummy_spinlock;
 
 static int	spins_per_delay = DEFAULT_SPINS_PER_DELAY;
 
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
new file mode 100644
index 0000000..0286817
--- /dev/null
+++ b/src/include/storage/barrier.h
@@ -0,0 +1,171 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.h
+ *	  Memory barrier operations.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/barrier.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef BARRIER_H
+#define BARRIER_H
+
+#include "storage/s_lock.h"
+
+extern slock_t	dummy_spinlock;
+
+/*
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier.  However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier.  Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ * 
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier.  Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores.  Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier.  In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ *
+ * For an introduction to using memory barriers within the PostgreSQL backend,
+ * see src/backend/storage/lmgr/README.barrier
+ */
+
+#if defined(DISABLE_BARRIERS)
+
+/*
+ * Fall through to the spinlock-based implementation.
+ */
+
+#elif defined(__INTEL_COMPILER)
+
+/*
+ * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
+ */
+#define pg_memory_barrier()		_mm_mfence()
+#define pg_compiler_barrier()	__memory_barrier()
+
+#elif defined(__GNUC__)
+
+/* This works on any architecture, since it's only talking to GCC itself. */
+#define pg_compiler_barrier()	__asm__ __volatile__("" : : : "memory")
+
+#if defined(__i386__) || defined(__x86_64__)		/* 32 or 64 bit x86 */
+
+/*
+ * x86 and x86_64 do not allow loads to be reorded with other loads, or
+ * stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered.  In those cases we'd need
+ * the read and write barriers to be lfence and sfence.  But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#define pg_memory_barrier()		\
+	__asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory")
+#define pg_read_barrier()		pg_compiler_barrier()
+#define pg_write_barrier()		pg_compiler_barrier()
+
+#elif defined(__ia64__) || defined(__ia64)
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mf" : : : "memory")
+
+#elif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("sync" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#elif defined(__alpha) || defined(__alpha__)      /* Alpha */
+
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case.  We might need to add that later.
+ */
+#define pg_memory_barrier()		__asm__ __volatile__ ("mb" : : : "memory")
+#define pg_read_barrier()		__asm__ __volatile__ ("rmb" : : : "memory")
+#define pg_write_barrier()		__asm__ __volatile__ ("wmb" : : : "memory")
+
+#elif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in.  But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#define pg_memory_barrier()		__sync_synchronize()
+
+#endif
+
+#elif defined(__ia64__) || defined(__ia64)
+
+#define pg_compiler_barrier()	_Asm_sched_fence()
+#define pg_memory_barrier()		_Asm_mf()
+
+#elif defined(WIN32_ONLY_COMPILER)
+
+/* Should work on both MSVC and Borland. */
+#include <intrin.h>
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier()	_ReadWriteBarrier()
+#define pg_memory_barrier()		MemoryBarrier()
+
+#endif
+
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock.  This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence.  But all of our actual implementations seem OK in this regard.
+ */
+#if !defined(pg_memory_barrier)
+#define pg_memory_barrier(x) \
+	do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
+#endif
+
+/*
+ * If read or write barriers are undefined, we upgrade them to full memory
+ * barriers.
+ *
+ * If a compiler barrier is unavailable, you probably don't want a full
+ * memory barrier instead, so if you have a use case for a compiler barrier,
+ * you'd better use #ifdef.
+ */
+#if !defined(pg_read_barrier)
+#define pg_read_barrier()			pg_memory_barrier()
+#endif
+#if !defined(pg_write_barrier)
+#define pg_write_barrier()			pg_memory_barrier()
+#endif
+
+#endif   /* BARRIER_H */

#16

Alvaro Herrera

alvherre@commandprompt.com

over 14 years ago

In reply to: Robert Haas (#13)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Excerpts from Robert Haas's message of jue sep 22 12:18:47 -0300 2011:

I've also added a lengthy README file to the patch that attempts to
explain how barriers should be used in PostgreSQL coding.

Very enlightening, thanks. Note a typo "laods".

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17

Kevin Grittner

Kevin.Grittner@wicourts.gov

over 14 years ago

In reply to: Robert Haas (#13)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

Robert Haas <robertmhaas@gmail.com> wrote:

I've also added a lengthy README file to the patch that attempts
to explain how barriers should be used in PostgreSQL coding. It's
certainly not a comprehensive treatment of the topic, but
hopefully it's enough to get people oriented. I've attempted to
tailor it a bit to PostgreSQL conventions, like talking about
shared memory vs.backend-private memory instead of assuming (as a
number of other discussions of this topic do) a thread model. It
also includes some advice about when memory barriers shouldn't be
used or won't work, and some references for further reading.

Thanks, that seems like it's at the right level of detail to me.

-Kevin

#18

Jeff Davis

pgsql@j-davis.com

over 14 years ago

In reply to: Robert Haas (#15)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, 2011-09-22 at 11:31 -0400, Robert Haas wrote:

On Thu, Sep 22, 2011 at 11:25 AM, Thom Brown <thom@linux.com> wrote:

s/visca-versa/vice-versa/
s/laods/loads/

Fixed. v4 attached.

Can you please explain the "more subtly" part below?

+A common pattern where this actually does result in a bug is when
adding items
+onto a queue.  The writer does this:
+
+    q->items[q->num_items] = new_item;
+    ++q->num_items;
+
+The reader does this:
+
+    num_items = q->num_items;
+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+This code turns out to be unsafe, because the writer might increment
+q->num_items before it finishes storing the new item into the
appropriate slot.
+More subtly, the reader might prefetch the contents of the q->items
array
+before reading q->num_items.

How would the reader prefetch the contents of the items array, without
knowing how big it is?

Regards,
Jeff Davis

#19

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Jeff Davis (#18)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 5:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:

+This code turns out to be unsafe, because the writer might increment
+q->num_items before it finishes storing the new item into the
appropriate slot.
+More subtly, the reader might prefetch the contents of the q->items
array
+before reading q->num_items.

How would the reader prefetch the contents of the items array, without
knowing how big it is?

By guessing or (I think) just by having a stale value left over in
some CPU cache. It's pretty mind-bending, but it's for real.

I didn't, in either the implementation or the documentation, go much
into the difference between dependency barriers and general read
barriers. We might need to do that at some point, but for a first
version I don't think it's necessary. But since you asked... as I
understand it, unless you're running on Alpha, you actually don't need
a barrier here at all, because all currently-used CPUs other than
alpha "respect data dependencies", which means that if q->num_items is
used to compute an address to be read from memory, the CPU will ensure
that the read of that address is performed after the read of the value
used to compute the address. At least that's my understanding. But
Alpha won't.

So we could try to further distinguish between read barriers where a
data dependency is present and read barriers where no data dependency
is present, and the latter type could be a no-op on all CPUs other
than Alpha. Or we could even jettison support for Alpha altogether if
we think it's hopelessly obsolete and omit
read-barriers-with-dependency altogether. I think that would be a bad
idea, though. First, it's not impossible that some future CPU could
have behavior similar to Alpha, and the likelihood of such a thing is
substantially more because of the fact that the Linux kernel, which
seems to be the gold standard in this area, still supports them. If
we don't record places where a dependency barrier would be needed and
then need to go find them later, that will be a lot more work, and a
lot more error-prone. Second, there's a natural pairing between read
barriers and write barriers. Generally, for every algorithm, each
write barrier on the write side should be matched by a read barrier on
the read side. So putting them all in will make it easier to verify
code correctness. Now, if we find down the line that some of those
read barriers are hurting our performance on, say, Itanium, or
PowerPC, then we can certainly consider distinguishing further. But
for round one I'm voting for not worrying about it. I think it's
going to be a lot more important to put our energy into (1) adding
barrier implementations for any platforms that aren't included in this
initial patch that we want to support, (2) making sure that all of our
implementations actually work, and (3) making sure that the algorithms
that use them are correct.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Jeff Davis

pgsql@j-davis.com

over 14 years ago

In reply to: Robert Haas (#19)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, 2011-09-22 at 19:12 -0400, Robert Haas wrote:

But since you asked... as I
understand it, unless you're running on Alpha, you actually don't need
a barrier here at all, because all currently-used CPUs other than
alpha "respect data dependencies", which means that if q->num_items is
used to compute an address to be read from memory, the CPU will ensure
that the read of that address is performed after the read of the value
used to compute the address. At least that's my understanding. But
Alpha won't.

I'm still trying to figure out how it's even possible to read an address
that's not computed yet. Something sounds strange about that...

I think it might have more to do with branch prediction or something
else. In your example, the address is not computed from q->num_items
directly, it's computed using "i". But that branch being followed is
dependent on a comparison with q->num_items. Maybe that's the dependency
that's not respected?

Regards,
Jeff Davis

#21

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Jeff Davis (#20)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 7:46 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2011-09-22 at 19:12 -0400, Robert Haas wrote:

But since you asked... as I
understand it, unless you're running on Alpha, you actually don't need
a barrier here at all, because all currently-used CPUs other than
alpha "respect data dependencies", which means that if q->num_items is
used to compute an address to be read from memory, the CPU will ensure
that the read of that address is performed after the read of the value
used to compute the address. At least that's my understanding. But
Alpha won't.

I'm still trying to figure out how it's even possible to read an address
that's not computed yet. Something sounds strange about that...

That's because it's strange. You might have a look at
http://www.linuxjournal.com/article/8212

Basically, it seems like on Alpha, the CPU is allowed to do pretty
much anything short of entirely fabricating the value that gets
returned.

I think it might have more to do with branch prediction or something
else. In your example, the address is not computed from q->num_items
directly, it's computed using "i". But that branch being followed is
dependent on a comparison with q->num_items. Maybe that's the dependency
that's not respected?

You might be right. I can't swear I understand exactly what goes
wrong there; in fact I'm not 100% sure that you don't need a
read-barrier on things less crazy than Alpha. I speculate that the
problem is something this: q->num_items is in some cache line and all
the elements of q->items is in some other cache line, and you see that
you're about to use both of those so you suck the cache lines into
memory. But because one cache bank is busier than the other, you get
q->items first. And between the time you get the cache line
containing q->items and the time you get the cache line containing
q->num_items, someone insert an item into the queue, and now you're
hosed, because you have the old array contents with the new array
length.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Greg Stark

stark@mit.edu

over 14 years ago

In reply to: Jeff Davis (#18)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 10:45 PM, Jeff Davis <pgsql@j-davis.com> wrote:

+    for (i = 0; i < num_items; ++i)
+        /* do something with q->items[i] */
+
+This code turns out to be unsafe, because the writer might increment
+q->num_items before it finishes storing the new item into the
appropriate slot.
+More subtly, the reader might prefetch the contents of the q->items
array
+before reading q->num_items.

How would the reader prefetch the contents of the items array, without
knowing how big it is?

I don't think this is as mysterious as it sounds. Imagine the compiler
unrolled the loop so that it does something like

fetch num_items into register
if register > 0 fetch q[0], process it
if register > 1 fetch q[1], process it
...

Then the cpu could easily do branch prediction on the ifs and fetch
q[0] while the fetch of num_items was still in progress. If it turns
out that num_items is 0 then it would invalidate the operations
initiated after the branch prediction but if it sees 1 (ie after the
increment in the other process) then it's not so it doesn't.

So you have two memory fetches which I guess I still imagine have to
be initiated in the right order but they're both in flight at the same
time. I have no idea how the memory controller works and I could
easily imagine either one grabbing the value before the other.

I'm not even clear how other processors can reasonably avoid this. It
must be fantastically expensive to keep track of which branch
predictions depended on which registers and which memory fetches the
value of those registers depended on. And then it would have to
somehow inform the memory controller of those old memory fetches that
this new memory fetch is dependent on and have it ensure that the
fetches are read the right order?

--
greg

#23

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Robert Haas (#15)

Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Thu, Sep 22, 2011 at 11:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 22, 2011 at 11:25 AM, Thom Brown <thom@linux.com> wrote:

s/visca-versa/vice-versa/
s/laods/loads/

Fixed. v4 attached.

Since it seems like people are fairly happy with this now, I've gone
ahead and committed this version. I suspect there are bugs, but I
don't think we're going to get much further until we actually try to
use this for something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24

Martijn van Oosterhout

kleptog@svana.org

over 14 years ago

In reply to: Greg Stark (#22)

Re: Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Fri, Sep 23, 2011 at 04:22:09PM +0100, Greg Stark wrote:

So you have two memory fetches which I guess I still imagine have to
be initiated in the right order but they're both in flight at the same
time. I have no idea how the memory controller works and I could
easily imagine either one grabbing the value before the other.

That's easy. If one is in cache and the other isn't then the results
will come back out of order.

I'm not even clear how other processors can reasonably avoid this. It
must be fantastically expensive to keep track of which branch
predictions depended on which registers and which memory fetches the
value of those registers depended on. And then it would have to
somehow inform the memory controller of those old memory fetches that
this new memory fetch is dependent on and have it ensure that the
fetches are read the right order?

I think memory accesses are also fantastically expensive, so it's worth
some effort to optimise that.

I found the Linux kernel document on this topic quite readable. I think
the main lesson here is that processors track data dependancies (other
than the Alpha apparently), but not control dependancies. So in the
example, the value of i is dependant on num_items, but not via any
calculation. IThat control dependancies are not tracked makes some
sense, since branches depend on flags bit, and just about any
calculation changes the flag bits, but most of the time these changes
are not used.

It also not a question of the knowing the address either, since the
first load, if any, will be *q->items, irrespective of the precise
value of num_items. This address may be calculated long in advance.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#25

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Martijn van Oosterhout (#24)

Re: Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Sat, Sep 24, 2011 at 9:45 AM, Martijn van Oosterhout
<kleptog@svana.org> wrote:

I think memory accesses are also fantastically expensive, so it's worth
some effort to optimise that.

This is definitely true.

I found the Linux kernel document on this topic quite readable. I think
the main lesson here is that processors track data dependancies (other
than the Alpha apparently), but not control dependancies. So in the
example, the value of i is dependant on num_items, but not via any
calculation. IThat control dependancies are not tracked makes some
sense, since branches depend on flags bit, and just about any
calculation changes the flag bits, but most of the time these changes
are not used.

Oh, that's interesting. So that implies that a read-barrier would be
needed here even on non-Alpha.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Martijn van Oosterhout

kleptog@svana.org

over 14 years ago

In reply to: Robert Haas (#25)

Re: Re: memory barriers (was: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs)

On Sat, Sep 24, 2011 at 12:46:48PM -0400, Robert Haas wrote:

I found the Linux kernel document on this topic quite readable. I think
the main lesson here is that processors track data dependancies (other
than the Alpha apparently), but not control dependancies. So in the
example, the value of i is dependant on num_items, but not via any
calculation. IThat control dependancies are not tracked makes some
sense, since branches depend on flags bit, and just about any
calculation changes the flag bits, but most of the time these changes
are not used.

Oh, that's interesting. So that implies that a read-barrier would be
needed here even on non-Alpha.

That is my understanding. At source code level the address being
referenced is dependant on i, but at assembly level it's possible i has
been optimised away altogether.

I think the relevent example is here:
http://www.mjmwired.net/kernel/Documentation/memory-barriers.txt (line 725)

Where A = q->items[0] and B = q->num_items.

There is no data dependancy here, so inserting such a barrier won't
help. You need a normal read barrier.

OTOH, if the list already has an entry in it, the problem (probably)
goes away, although with loop unrolling you can't really be sure.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer