Yes, WaitLatch is vulnerable to weak-memory-ordering bugs
I suspected $SUBJECT from the beginning, and I've now put in enough work
to be able to prove it. The attached test program reliably fails within
a few minutes of being started, when run with 8 worker processes on an
8-core PPC machine. It's a pretty simple "token passing ring" protocol,
and at some point one of the processes sees its latch set without seeing
its flag set, so it goes back to sleep and the token stops getting passed.
The original worry about the latch code was that there was some internal
race condition of this sort, but I think we were failing to see the forest
for the trees. The specification of how to use a latch reads like this:
* The correct pattern to wait for an event is:
*
* for (;;)
* {
* ResetLatch();
* if (work to do)
* Do Stuff();
*
* WaitLatch();
* }
*
* It's important to reset the latch *before* checking if there's work to
* do. Otherwise, if someone sets the latch between the check and the
* ResetLatch call, you will miss it and Wait will block.
*
* To wake up the waiter, you must first set a global flag or something
* else that the main loop tests in the "if (work to do)" part, and call
* SetLatch *after* that.
Any protocol of that sort has *obviously* got a race condition between
changes of the latch state and changes of the separate flag state, if run
in a weak-memory-ordering machine.
At least on the hardware I'm testing, it seems that the key requirement
for a failure to be seen is that there's a lot of contention for the cache
line containing the flag, but not for the cache line containing the latch.
This allows the write to the flag to take longer to propagate to main
memory than the write to the latch. I could not get it to fail until
I added the extra shared-memory traffic in the delay loop around line 175.
This probably also explains why it doesn't fail with small numbers of
workers --- you need enough contending CPUs to saturate the memory
hardware. (On Red Hat's test machines, 7 or 8 workers would reliably
fail within a few minutes, but 6 or fewer didn't always.)
I'm not sure whether we need to do anything about this for 9.1; it may be
that none of the existing uses of latches are subject to the kind of
contention that would create a risk. But I'm entirely convinced that we
need to insert some memory-barrier instructions into the latch code before
we expand our uses of latches much more. Since we were already finding
that memory barriers will be needed for performance reasons elsewhere,
I think it's time to bite the bullet and develop that infrastructure.
regards, tom lane
On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Any protocol of that sort has *obviously* got a race condition between
changes of the latch state and changes of the separate flag state, if run
in a weak-memory-ordering machine.At least on the hardware I'm testing, it seems that the key requirement
for a failure to be seen is that there's a lot of contention for the cache
line containing the flag, but not for the cache line containing the latch.
This allows the write to the flag to take longer to propagate to main
memory than the write to the latch.
This is interesting research, especially because it provides sobering
evidence of how easily a real problem could be overlooked in testing.
If it took several minutes for this to fail on an 8-CPU system, it's
easy to imagine a less egregious example sitting in our code-base for
years undetected. (Indeed, I think we probably have a few of those
already...)
On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way. Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch. Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.
I'm not sure whether we need to do anything about this for 9.1; it may be
that none of the existing uses of latches are subject to the kind of
contention that would create a risk. But I'm entirely convinced that we
need to insert some memory-barrier instructions into the latch code before
we expand our uses of latches much more. Since we were already finding
that memory barriers will be needed for performance reasons elsewhere,
I think it's time to bite the bullet and develop that infrastructure.
I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
barrier-v1.patchapplication/octet-stream; name=barrier-v1.patchDownload
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
new file mode 100644
index 0000000..a373740
--- /dev/null
+++ b/src/include/storage/barrier.h
@@ -0,0 +1,170 @@
+/*-------------------------------------------------------------------------
+ *
+ * barrier.h
+ * Memory barrier operations.
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/atomic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMIC_H
+#define ATOMIC_H
+
+#include "storage/s_lock.h"
+
+extern slock_t dummy_spinlock;
+
+/*
+ * A compiler barrier need not (and preferably should not) emit any actual
+ * machine code, but must act as an optimization fence: the compiler must not
+ * reorder loads or stores to main memory around the barrier. However, the
+ * CPU may still reorder loads or stores at runtime, if the architecture's
+ * memory model permits this.
+ *
+ * A memory barrier must act as a compiler barrier, and in addition must
+ * guarantee that all loads and stores issued prior to the barrier are
+ * completed before any loads or stores issued after the barrier. Unless
+ * loads and stores are totally ordered (which is not the case on most
+ * architectures) this requires issuing some sort of memory fencing
+ * instruction.
+ *
+ * A read barrier must act as a compiler barrier, and in addition must
+ * guarantee that any loads issued prior to the barrier are completed before
+ * any loads issued after the barrier. Similarly, a write barrier acts
+ * as a compiler barrier, and also orders stores. Read and write barriers
+ * are thus weaker than a full memory barrier, but stronger than a compiler
+ * barrier. In practice, on machines with strong memory ordering, read and
+ * write barriers may require nothing more than a compiler barrier.
+ */
+
+#if defined(DISABLE_ATOMICS)
+
+/*
+ * Fall through to the spinlock-based implementation.
+ */
+
+#elsif defined(__INTEL_COMPILER)
+
+/*
+ * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
+ */
+#define pg_memory_barrier() _mm_mfence()
+#define pg_compiler_barrier() __memory_barrier()
+
+#elsif defined(__GNUC__)
+
+/* This works on any architecture, since it's only talking to GCC itself. */
+#define pg_compiler_barrier() __asm__ __volatile__("" : : "memory")
+
+#if defined(__i386__) || defined(__x86_64__) /* 32 or 64 bit x86 */
+
+/*
+ * x86 and x86_64 do not allow loads to be reorded with other loads, or
+ * stores to be reordered with other stores, but a load can be performed
+ * before a subsequent store.
+ *
+ * "lock; addl" has worked for longer than "mfence".
+ *
+ * Technically, some x86-ish chips support uncached memory access and/or
+ * special instructions that are weakly ordered. In those cases we'd need
+ * the read and write barriers to be lfence and sfence. But since we don't
+ * do those things, a compiler barrier should be enough.
+ */
+#define pg_memory_barrier() \
+ __asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : "memory")
+#define pg_read_barrier() pg_compiler_barrier()
+#define pg_write_barrier() pg_compiler_barrier()
+
+#elsif defined(__ia64__) || defined(__ia64)
+
+/*
+ * Itanium is weakly ordered, so read and write barriers require a full
+ * fence.
+ */
+#define pg_memory_barrier() __asm__ __volatile__ ("mf" : : "memory")
+
+#elsif defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__)
+
+/*
+ * lwsync orders loads with respect to each other, and similarly with stores.
+ * But a load can be performed before a subsequent store, so sync must be used
+ * for a full memory barrier.
+ */
+#define pg_memory_barrier() __asm__ __volatile__ ("sync" : : "memory")
+#define pg_read_barrier() __asm__ __volatile__ ("lwsync" : : "memory")
+#define pg_write_barrier() __asm__ __volatile__ ("lwsync" : : "memory")
+
+#elsif defined(__alpha) || defined(__alpha__) /* Alpha */
+
+/*
+ * Unlike all other known architectures, Alpha allows dependent reads to be
+ * reordered, but we don't currently find it necessary to provide a conditional
+ * read barrier to cover that case. We might need to add that later.
+ */
+#define pg_memory_barrier() __asm__ __volatile__ ("mb" : : "memory")
+#define pg_read_barrier() __asm__ __volatile__ ("rmb" : : "memory")
+#define pg_write_barrier() __asm__ __volatile__ ("wmb" : : "memory")
+
+#elsif __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 1)
+
+/*
+ * If we're on GCC 4.1.0 or higher, we should be able to get a memory
+ * barrier out of this compiler built-in. But we prefer to rely on our
+ * own definitions where possible, and use this only as a fallback.
+ */
+#define pg_memory_barrier() __sync_synchronize()
+
+#endif
+
+#elsif defined(__ia64__) || defined(__ia64)
+
+#define pg_compiler_barrier() _Asm_sched_fence()
+#define pg_memory_barrier() _Asm_mf()
+
+#elsif defined(WIN32_ONLY_COMPILER)
+
+/*
+ * Should work on MSVC and Borland. Won't work on Visual Studio 2003, but
+ * we don't support that anyway.
+ */
+#include <intrin.h>
+#pragma intrinsic(_ReadWriteBarrier)
+#define pg_compiler_barrier() _ReadWriteBarrier()
+#define pg_memory_barrier() MemoryBarrier()
+
+#endif
+
+/*
+ * If we have no memory barrier implementation for this architecture, we
+ * fall back to acquiring and releasing a spinlock. This might, in turn,
+ * fall back to the semaphore-based spinlock implementation, which will be
+ * amazingly slow.
+ *
+ * It's not self-evident that every possible legal implementation of a
+ * spinlock acquire-and-release would be equivalent to a full memory barrier.
+ * For example, I'm not sure that Itanium's acq and rel add up to a full
+ * fence. But all of our actual implementations seem OK in this regard.
+ */
+#if !defined(pg_memory_barrier)
+#define pg_compiler_barrier(x) \
+ do { S_LOCK(&dummy_spinlock); S_UNLOCK(&dummy_spinlock); } while (0)
+#endif
+
+/*
+ * If any of the weaker barrier operations are undefined, we upgrade them
+ * to full memory barriers.
+ */
+#if !defined(pg_compiler_barrier)
+#define pg_compiler_barrier() pg_memory_barrier()
+#endif
+#if !defined(pg_read_barrier)
+#define pg_read_barrier() pg_memory_barrier()
+#endif
+#if !defined(pg_write_barrier)
+#define pg_write_barrier() pg_memory_barrier()
+#endif
+
+#endif /* ATOMIC_H */
On 8 August 2011 13:47, Robert Haas <robertmhaas@gmail.com> wrote:
On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way. Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch.
I'm inclined to agree. FWIW I'll point out that in addition to your
point, it's also worth remembering that a shared latch isn't always
necessary, as in the case of the archiver, so this shouldn't
fundamentally shake our confidence in the latch implementation.
Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.
Or, maybe we should think about pushing that into the latch
implementation, while providing an interface that is easy to use
correctly and hard to use incorrectly. The latch code already has a
pretty complicated contract, and I had to bend over backwards using it
in the AVLauncher patch that I submitted to the upcoming commitfest.
Weakening or equivocating the contract any further should be
considered a last resort.
I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.
That's interesting. Nice work.
It seems likely that Windows will support ARM in some way in the next
couple of years, so it's good that you're handling this using a win32
abstraction where available. Of course, Itanic is currently supported
on Windows, though not by us, and it is obviously never going to be
worth pursuing such support. The point is that it generally isn't safe
to assume that we'll only ever support Windows on x86 and x86_64.
I'd like to see a #warning where you fall back on acquiring and
releasing a spinlock, or at least a #warning if that in turn falls
back on the semaphore-based spinlock implementation. Granted, that
directive isn't in the C standard, but it's pretty portable in
practice. If it breaks the build on some very esoteric platform, that
may not be a bad thing - in fact, maybe you should use the portable
#error directive to make sure it breaks. I'd rather have someone
complain about that than lots more people complain about Postgres
being inexplicably slow, or worse not complain and dismiss Postgres
out of hand for their use-case.
By the way, I don't think that the comment "Won't work on Visual
Studio 2003" is accurate. Everything you do for the
defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Aug 8, 2011 at 11:28 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.Or, maybe we should think about pushing that into the latch
implementation, while providing an interface that is easy to use
correctly and hard to use incorrectly.
That would probably result in unnecessary double-synchronization in
many cases, though. Memory barriers are expensive, and I think we'd
better adopt a policy of trying to figure out where they are truly
needed and using them only in those places. Or to put that another
way, having spent the last couple of months trying to reduce our
synchronization overhead, I'm not real keen on adding more unless we
really need it.
I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases. The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.That's interesting. Nice work.
Thanks!
It seems likely that Windows will support ARM in some way in the next
couple of years, so it's good that you're handling this using a win32
abstraction where available. Of course, Itanic is currently supported
on Windows, though not by us, and it is obviously never going to be
worth pursuing such support. The point is that it generally isn't safe
to assume that we'll only ever support Windows on x86 and x86_64.
Agreed. I saw an MSDN article that referred to "architectures on
which Windows is supported". It didn't say which those were, but I
figured I'd better not assume they were all platforms with strong
memory ordering.
I'd like to see a #warning where you fall back on acquiring and
releasing a spinlock, or at least a #warning if that in turn falls
back on the semaphore-based spinlock implementation. Granted, that
directive isn't in the C standard, but it's pretty portable in
practice. If it breaks the build on some very esoteric platform, that
may not be a bad thing - in fact, maybe you should use the portable
#error directive to make sure it breaks. I'd rather have someone
complain about that than lots more people complain about Postgres
being inexplicably slow, or worse not complain and dismiss Postgres
out of hand for their use-case.
I was thinking about whether and how we could expose that information.
I was almost tempted to add a read-only GUC that would output the
details of what kind of synchronization we're using. So the bad case
would look something like this:
spinlock=semaphore,memory_barrier=spinlock,read_barrier=spinlock,write_barrier=spinlock,compiler_barrier=spinlock
And the good case would look something like this:
spinlock=gcc-inline-asm,memory_barrier=gcc-inline-asm,read_barrier=gcc-inline-asm,write_barrier=gcc-inline-asm,compiler_barrier=gcc-inline-asm
And you can imagine other values, like compiler-intrinsic. Of course,
jamming a whole CSV file into a GUC value is a bit unappealing. Maybe
it would be better to have a system view with two columns,
synchronization_primitive and implementation.
Either way, I'd prefer this to a #warning, because that's only going
to be visible at compile-time, which means that when $PGCOMPANY gets
called in to figure out why the system is dog slow, it's going to take
a bit of forensics to figure out where the build came from and what
options are in use. Is that a huge problem? Maybe not, but it seems
like it would be a nice thing to make it easy for people to log into a
running system and immediately be able to determine which forms of
wizardry we're using there.
By the way, I don't think that the comment "Won't work on Visual
Studio 2003" is accurate. Everything you do for the
defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.
Oh, really? I thought I ran across something that said otherwise, but
it might have been wrong, or maybe I got confused somewhere along the
line. I haven't tested anything more than that this compiles without
erroring out on MacOS X, so the probability of errors and omissions is
not small.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Any protocol of that sort has *obviously* got a race condition between
changes of the latch state and changes of the separate flag state, if run
in a weak-memory-ordering machine.
On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way. Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch. Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.
Heikki argued the same thing in
http://archives.postgresql.org/message-id/4CEA5A0F.1030602@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe. So [ ... warms up grep ... ]
Currrent callers of WaitLatch(OrSocket):
XLogPageRead waits on XLogCtl->recoveryWakeupLatch
latch is set by WakeupRecovery, which is called by process'
own signal handlers (OK) and XLogWalRcvFlush. The latter is OK
because there's a spinlock protecting the transmitted data.
pgarch_MainLoop waits on mainloop_latch
non-issue because latch is process-local
WalSndLoop waits on MyWalSnd->latch
latch is set by signal handlers and WalSndWakeup. The latter is
OK because it's called right after XLogFlush (which certainly
locks whatever it touches) or a spinlock release.
SyncRepWaitForLSN waits on MyProc->waitLatch
latch is set by signal handlers and SyncRepWakeQueue. The
latter appears unsafe at first glance, because it sets the
shared variable (thisproc->syncRepState) and immediately
sets the latch. However, the receiver does this curious dance:
syncRepState = MyProc->syncRepState;
if (syncRepState == SYNC_REP_WAITING)
{
LWLockAcquire(SyncRepLock, LW_SHARED);
syncRepState = MyProc->syncRepState;
LWLockRelease(SyncRepLock);
}
which I believe makes it safe, though rather underdocumented;
if a race does occur, the receiver will acquire the lock and
recheck, and the lock acquisition will block until the caller of
SyncRepWakeQueue has released SyncRepLock, and that release
will flush any pending writes to shared memory.
Conclusions:
(1) 9.1 does not have a problem of this sort.
(2) Breathing hard on any of this code could break it.
(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code. Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.
(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking. I think SyncRepWaitForLSN could use a warning
comment, too. I will go do that.
regards, tom lane
On 08.08.2011 19:39, Tom Lane wrote:
Robert Haas<robertmhaas@gmail.com> writes:
On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way. Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch. Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.Heikki argued the same thing in
http://archives.postgresql.org/message-id/4CEA5A0F.1030602@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe. So [ ... warms up grep ... ]
Thanks for double-checking.
(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code. Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.
I agree we definitely should add the memory barrier instruction there
once we have them.
(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking. I think SyncRepWaitForLSN could use a warning
comment, too. I will go do that.
Ok, thanks.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com