sinval synchronization considered harmful
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.
On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:
[1 client]
tps = 4459.203159 (including connections establishing)
tps = 4488.456559 (including connections establishing)
tps = 4449.570530 (including connections establishing)
[8 clients]
tps = 33524.668762 (including connections establishing)
tps = 33501.833907 (including connections establishing)
tps = 33382.380908 (including connections establishing)
[32 clients]
tps = 178394.863906 (including connections establishing)
tps = 178457.706972 (including connections establishing)
tps = 179365.310179 (including connections establishing)
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
With the lazy vxid locks patch, all of those numbers get better by a
few percentage points (the more clients, the more points) except at 80
clients, where the performance is sometimes better and sometimes
worse. These are 5-minute runs:
[80 clients, with lazy vxid locks]
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)
I can't explain in detail why there is so much variation between the
5-minute runs, or why some runs perform worse than without the lazy
vxid locks, but I can say that apply the first of the two attached
patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid
of SInvalReadLock and instead gives each backend its own spinlock.
SICleanupQueue() must therefore get somewhat fuzzy answers, but it
shouldn't matter. The only real problem is if you need to do the
super-duper-cleanup where you subtract a constant from all the values
in unison. I just got rid of that completely, by widening the
counters to 64 bits. Assuming I haven't botched the implementation,
this is clearly a win. There is still some performance drop-off going
from 32 clients to 80 clients, but it is much less.
[80 clients, with lazy vxid locks and sinval-per-backend-mutex]
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).
Now this implementation (sinval-unlocked.patch) is obviously not a
serious proposal as it stands, because on processors with weak memory
ordering it will presumably blow up all over the place. But here are
the results on x86:
[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Now, that's actually *faster* then the above 32-core numbers. Turns
out, with this approach, there's essentially no degradation between 32
clients and 80 clients. It appears that even one spinlock acquisition
in SIGetDataEntries() is too many. Currently, we have *three*: one to
get SInvalReadLock, one to get msgnumLock, and one to release
SInvalReadLock.
For contrast, on a two-core MacBook Pro, I can't measure any
difference at all between lazy-vxid,
lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
The last might even be a touch slower, although there's enough noise
that it's hard to tell for sure, and the effect is very small if there
is one. But on the 32-core machine, the differences are dramatic.
Thoughts? Comments? Ideas?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
SIGetDataEntries(). I believe we need to bite the bullet and
rewrite this using a lock-free algorithm, using memory barriers on
processors with weak memory ordering.
[32 processors; 80 clients]
On unpatched master
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
With the lazy vxid locks patch
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)
gets rid of SInvalReadLock and instead gives each backend its own
spinlock.
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)
SIGetDataEntries() can pretty easily be made lock-free.
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Thoughts? Comments? Ideas?
Very impressive! Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others. With the number of CPUs climbing as it is, this is very
important work!
-Kevin
Import Notes
Reference msg id not found: 4E27D703020000250003F626@gw.wicourts.govReference msg id not found: 4E280A8C020000250003F661@gw.wicourts.gov | Resolved by subject fallback
On Thu, Jul 21, 2011 at 12:16 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Very impressive! Those numbers definitely justify some #ifdef code
to provide alternatives for weak memory ordering machines versus
others. With the number of CPUs climbing as it is, this is very
important work!
Thanks. I'm not thinking so much about #ifdef (although that could
work, too) as I am about providing some primitives to allow this sort
of code-writing to be done in a somewhat less ad-hoc fashion. It
seems like there are basically two categories of machines we need to
worry about.
1. Machines with strong memory ordering. On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order. On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program. The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization. On x86, certain categories of exotic
instructions do require
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
So you can imagine a primitive that is defined to be a compiler
barrier on machines with strong memory ordering, and as a memory
fencing instruction on machines with weak memory ordering.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.
To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.
regards, tom lane
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.
Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]This was a suggestion from Noah Misch. I wasn't quite convinced when he initially made it, but having studied the issue a lot more, I now am. The CPU doesn't know how many processes have the memory mapped into their address space.. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
[1]: This was a suggestion from Noah Misch. I wasn't quite convinced when he initially made it, but having studied the issue a lot more, I now am. The CPU doesn't know how many processes have the memory mapped into their address space.
when he initially made it, but having studied the issue a lot more, I
now am. The CPU doesn't know how many processes have the memory
mapped into their address space.
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.
I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.
Yes!
More processors is better, of course, but having anything at all to
test on would be an improvement.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jul 21, 2011 at 3:22 PM, Dave Page <dpage@pgadmin.org> wrote:
On Thu, Jul 21, 2011 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
Getting it to work on x86 is not the hard part.I believe there's a PPC box in our storage facility in NJ that we
might be able to dig out for you. There's also a couple in our India
office. Let me know if they'd be of help.Yes!
More processors is better, of course, but having anything at all to
test on would be an improvement.
OK, will check with India first, as it'll be easier for them to deploy.
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake
EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.
There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-). Note that testing on a single-core
PPC will prove nothing.
regards, tom lane
On Thu, Jul 21, 2011 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
I think the real challenge is going to be testing. If anyone has a
machine with weak memory ordering they can give me access to, that
would be really helpful for flushing the bugs out of this stuff.There are multi-CPU PPCen in the buildfarm, or at least there were last
time I broke the sinval code ;-). Note that testing on a single-core
PPC will prove nothing.
Yeah, I was just thinking about that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)
[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)
Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
sound direction.
In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
sound direction.In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?
Good question - I have not tested.
One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:
if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)
ReallyAcceptInvalidationMessages();
That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe. You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture. And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jul21, 2011, at 21:15 , Robert Haas wrote:
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.
As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.
Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".
Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".
Now assume the following code fragment
global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;
If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens is
SpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();
But once that can happens, there's no reason that it couldn't also be
SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();
I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.
best regards,
Florian Pflug
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement
On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those
stores.
--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jul 21, 2011 at 02:31:15PM -0400, Robert Haas wrote:
1. Machines with strong memory ordering. On this category of machines
(which include x86), the CPU basically does not perform loads or
stores out of order. On some of these machines, it is apparently
possible for there to be some ordering of stores relative to loads,
but if the program stores two values or loads two values, those
operations will performed in the same order they appear in the
program.
This is all correct, but...
The main thing you need to make your code work reliably on
these machines is a primitive that keeps the compiler from reordering
your code during optimization.
If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.
Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.
The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)
Alternatively, one might question how important it is to support the
Alpha these days...
Dan
--
Dan R. K. Ports MIT CSAIL http://drkp.net/
On Jul21, 2011, at 03:46 , Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).
Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).
I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...
If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...
best regards,
Florian Pflug
On Thu, Jul 21, 2011 at 6:22 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jul21, 2011, at 21:15 , Robert Haas wrote:
On Thu, Jul 21, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... On these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.Right, and the reason that a spinlock fixes it is that we have memory
barrier instructions built into the spinlock code sequences on machines
where it matters.To get to the point where we could do the sort of optimization Robert
is talking about, someone will have to build suitable primitives for
all the platforms we support. In the cases where we use gcc ASM in
s_lock.h, it shouldn't be too hard to pull out the barrier
instruction(s) ... but on platforms where we rely on OS-supplied
functions, some research is going to be needed.Yeah, although falling back to SpinLockAcquire() and SpinLockRelease()
on a backend-private slock_t should work anywhere that PostgreSQL
works at all[1]. That will probably be slower than a memory fence
instruction and certainly slower than a compiler barrier, but the
point is that - right now - we're doing it the slow way everywhere.As I discovered while playing with various lockless algorithms to
improve our LWLocks, spin locks aren't actually a replacement for
a (full) barrier.Lock acquisition only really needs to guarantee that loads and stores
which come after the acquisition operation in program order (i.e., in
the instruction stream) aren't globally visible before that operation
completes. This kind of barrier behaviour is often fittingly called
"acquire barrier".Similarly, a lock release operation only needs to guarantee that loads
and stores which occur before that operation in program order are
globally visible before the release operation completes. This, again,
is fittingly called "release barrier".Now assume the following code fragment
global1 = 1;
SpinLockAcquire();
SpinLockRelease();
global2 = 1;If SpinLockAcquire() has "acquire barrier" semantics, and SpinLockRelease()
has "release barrier" sematics, the it's possible for the store to global1
to be delayed until after SpinLockAcquire(), and similarly for the store
to global2 to be executed before SpinLockRelease() completes. In other
words, what happens isSpinLockAcquire();
global1 = 1;
global2 = 1;
SpinLockRelease();But once that can happens, there's no reason that it couldn't also be
SpinLockAcquire();
global2 = 1;
global1 = 1;
SpinLockRelease();I didn't check if any of our spin lock implementations is actually affected
by this, but it doesn't seem wise to rely on them being full barriers, even
if it may be true today.
Hmm. I'm not worried about that. AFAIK, only IA64 has such an
implementation, and our existing spinlock implementation doesn't use
it. If we were to add something like that in the future, we'd
presumably know that we were doing it, and would add the appropriate
memory barrier primitive at the same time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 6:44 PM, Dan Ports <drkp@csail.mit.edu> wrote:
If you're suggesting that hardware memory barriers aren't going to be
needed to implement lock-free code on x86, that isn't true. Because a
read can be reordered with respect to a write to a different memory
location, you can still have problems. So you do still need memory
barriers, just fewer of them.Dekker's algorithm is the classic example: two threads each set a flag
and then check whether the other thread's flag is set. In any
sequential execution, at least one should see the other's flag set, but
on the x86 that doesn't always happen. One thread's read might be
reordered before its write.
In the case of sinval, what we need to do for SIGetDataEntries() is,
approximately, a bunch of loads, followed by a store to one of the
locations we loaded (which no one else can have written meanwhile).
So I think that's OK.
In SIInsertDataEntries(), what we need to do is, approximately, take a
lwlock, load from a location which can only be written while holding
the lwlock, do a bunch of stores, ending with a store to that first
location, and release the lwlock. I think that's OK, too.
2. Machines with weak memory ordering. On this category of machines
(which includes PowerPC, Dec Alpha, and maybe some others), the CPU
reorders memory accesses arbitrarily unless you explicitly issue
instructions that enforce synchronization. You still need to keep the
compiler from moving things around, too. Alpha is particularly
pernicious, because something like a->b can fetch the pointed-to value
before loading the pointer itself. This is otherwise known as "we
have basically no cache coherency circuits on this chip at all". On
these machines, you need to issue an explicit memory barrier
instruction at each sequence point, or just acquire and release a
spinlock.The Alpha is pretty much unique (thankfully!) in allowing dependent
reads to be reordered. That makes it even weaker than the typical
weak-ordering machine. Since reading a pointer and then dereferencing
it is a pretty reasonable thing to do regularly in RCU code, you
probably don't want to emit barriers in between on architectures where
it's not actually necessary. That argues for another operation that's
defined to be a barrier (mb) on the Alpha but a no-op elsewhere.
Certainly the Linux kernel found it useful to do so
(read_barrier_depends)Alternatively, one might question how important it is to support the
Alpha these days...
Well, currently, we do, so we probably don't want to drop support for
that without some careful thought. I searched the archive and found
someone trying to compile 8.3.something on Alpha just a few years ago,
so it's apparently not totally dead yet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrementOn second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those
stores.
Now that is a potentially big problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah@2ndquadrant.com> wrote:
On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement
On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those stores.
Now that is a potentially big problem.
Could we do something similar to the xxid hacks? That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word. You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.
regards, tom lane