spinlock contention
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun12, 2011, at 23:39 , Robert Haas wrote:
So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...
I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.
SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode. Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.
So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.
There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.
A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.
If anyone has read this far, you have my undying gratitude....
especially if you respond with comments.
ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 23.06.2011 18:42, Robert Haas wrote:
ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.
ProcArrayLock is currently held for a relatively long period of time
when a snapshot is taken, especially if there's a lot of backends. There
are three operations to the procarray:
1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.
Advertising a new xid is currently done without a lock, assuming that
setting an xid in shared memory is atomic. To end a transaction, you
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire
ProcArrayLock in shared mode, and scan the whole procarray.
I wonder if it would be a better tradeoff to keep a "materialized"
snapshot in shared memory that's updated every time a new xid is
assigned or a transaction ends. Getting a snapshot would become much
cheaper, as you could just memcpy the ready-built snapshot from shared
memory into local memory.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Jun23, 2011, at 17:42 , Robert Haas wrote:
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun12, 2011, at 23:39 , Robert Haas wrote:
So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.
It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.
SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.
This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:
http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread
http://lwn.net/Articles/284741/
It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).
The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.
Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.
This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).
The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.
Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.
So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.
Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.
There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.
If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as
#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.
A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.
OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...
best regards,
Florian Pflug
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 23.06.2011 18:42, Robert Haas wrote:
ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.ProcArrayLock is currently held for a relatively long period of time when a
snapshot is taken, especially if there's a lot of backends. There are three
operations to the procarray:1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.Advertising a new xid is currently done without a lock, assuming that
setting an xid in shared memory is atomic.
The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects. But I think even if they do,
there's no real harm done. Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.
I am rather doubtful that this code is strictly safe:
myproc->subxids.xids[nxids] = xid;
myproc->subxids.nxids = nxids + 1;
In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.
To end a transaction, you acquire
ProcArrayLock in exclusive mode. To get a snapshot, you acquire
ProcArrayLock in shared mode, and scan the whole procarray.I wonder if it would be a better tradeoff to keep a "materialized" snapshot
in shared memory that's updated every time a new xid is assigned or a
transaction ends. Getting a snapshot would become much cheaper, as you could
just memcpy the ready-built snapshot from shared memory into local memory.
At least in the case of the pgbench -S workload, I doubt that would
help at all. The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway. The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun23, 2011, at 17:42 , Robert Haas wrote:
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun12, 2011, at 23:39 , Robert Haas wrote:
So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following threadhttp://lwn.net/Articles/284741/
It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...
yup, 'lock cmpxchg8b'.
see: http://www.niallryan.com/node/137 for a way to do it via fpu
without a loop. the 'lock' is a bus lock.
merlin
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:
It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.
Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle. That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.
SInvalReadLock is a good example of a lock which seems barely to e
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following threadhttp://lwn.net/Articles/284741/
It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.
I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.
Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.
Interesting idea.
So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.
I was hoping to do it that way, too, but I have so far been able to
demonstrate a performance benefit from that approach.
There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.
Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.
A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...
Maybe so... but if I can get by without needing that, so much the
better. I don't really want to have to have a separate code path for
architectures that don't support that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jun23, 2011, at 22:15 , Robert Haas wrote:
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:
It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.
Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.
That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.
Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
}
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).
There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.
The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.
best regards,
Florian Pflug
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug <fgp@phlo.org> wrote:
Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.
Hey, if we can show that it works, sign me up.
That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
}
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).
Attached...
The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.
Yes. But if we go that route, then we have to also support a code
path for architectures for which we don't have that support. That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
lwlock-v1.patchapplication/octet-stream; name=lwlock-v1.patchDownload+351-33
On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.
Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.
So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.
--
greg
On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <stark@mit.edu> wrote:
On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:
ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.
Yeah, I've been thinking about that. I think what we might do is set
up a new SLRU that works like CLOG, but each entry is an LSN rather
than just two bits. When a transaction commits, we save the commit
LSN under the entry for that XID. We truncate away SLRU pages that
contain no still-running XIDs. When we need to check whether an XID
is visible to our snapshot, we just look up the commit LSN and compare
it with our snapshot LSN. If it's before and non-zero, we can see it.
If it's after or all-zeroes, we can't.
But I'm not sure how much this would really help. It might (subject
to working out the details) make the actual process of taking a
snapshot faster. But it's not clear that taking snapshots more
quickly will actually help anything, because the problem is not the
amount of time spending taking the snapshot. The problem is rather
that doing so requires acquiring and releasing an LWLock, and each of
those operations requires taking and releasing a spinlock. And it is
the spinlock contention that is killing us. That makes me think we
need a way to take a snapshot without taking a spinlock. Or if we
must take spinlocks, we at least have to avoid every backend that
needs a snapshot lusting after the *same* spinlock.
What I've been thinking about this weekend is whether it might be
possible to create a sort of lock-free queue infrastructure. When a
backend starts up, it would add an entry to the queue saying "I'm
running". When it commits, it would add an entry to the queue saying
"I'm committed". All entries would be added at the *end* of the
queue, so a backend scanning the queue to build up a snapshot wouldn't
ever be able to see commits out of order. We would need some memory
barrier operations on weak-memory-ordering machines to make sure that
the queue writes became visible before the end-of-queue pointer bump.
The trick is figuring out how to clean up the queue. Since "commit"
entries exist only to guard against "running" entries earlier in the
queue, the start-of-queue pointer can be advanced whenever it points
to a "commit" entry. Also, if it points to a "running" entry for
which there is a later "commit" entry, then the start-of-queue pointer
can be advanced over that as well. However, just because we can
advance the point at which backends start reading doesn't mean that we
can actually recycle space, because while we know that new scans
needn't worry about those entries, we *don't* know that there isn't
already a scan in flight that still needs them. Furthermore, if a
transaction runs for a long time, we can never advance the
start-of-queue pointer past the "running" entry for its XID, which is
obviously problematic since the queue would get very long.
To work around that problem, I think we could use Florian's idea
upthread of an RCU system. We keep two copies of the queue around, an
A copy and a B copy. When the currently active copy fills up, we
rewrite it into the other queue, omitting all "committed" entries and
any "running" entries that have matching "committed" entries, and then
tell everyone to start looking at that copy instead. We would need
some kind of gymnastics to make sure that we don't flip from the A
copy to the B copy and back to the A copy while some laggardly backend
is still hoping to scan the old A copy. A simple algorithm (there's
probably a smarter one) would be to have each backend take a spinlock
while it's scanning either copy, and to have the backend that is doing
the rewrite take and release all of those spinlocks one at a time
before beginning the rewrite, thus guaranteeing that any scans still
in progress when the rewrite is requested have completed before it's
actually performed. Any new scans started in the meanwhile will
certainly be looking at the current copy rather than the old copy
we're about to overwrite.
We would still need a lock around the operation of adding new items to
the queue; if two backends try to do that at the same time, chaos will
ensue. But if that lock gets busy, it would be possible to have
backends use some sort of system of elimination buffers to combine
their requests. If ten backends each want to advertise a new XID, it
will be far faster to have one backend acquire the lock, write in all
ten entries, and wake everyone up than to have each backend insert its
entry and wake up the next one. So what we might do is doing a
condition-acquire on the lock to insert data into the buffer and, if
we can't get it immediately, go find another backend with the same
problem and elect a leader to do all the insertions. If we're the
leader, sleep on the queue-insertion lock. If not, sleep until the
leader wakes us up.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jun23, 2011, at 23:40 , Robert Haas wrote:
I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.
I got curious and put a tool together to benchmark this. The code can
be found at https://github.com/fgp/lockbench.
The tool starts N workers who each acquire a lock in SHARE mode, increment
a per-worker counter and release the lock again, rinse, repeat. That is
done for T seconds, after which it reports the sum of the individual
counters, the elapsed (wall) time and the sums of the user and system times
consumed by the workers. The lock implementations currently supported are
atomicinc: RW-Lock using atomic increment/decrement instructions
(locked xaddl) to acquire and release the lock in SHARE
mode in the non-contested case. The contested case is
unimplemented.
cmpxchng: Like atomicinc, but uses "locked cmpxchng" loop instead of
"locked xaddl".
spin: RW-Lock built around a simple cmpxchng-based spinlock (i.e.,
to lock/unlock spinlock is taken, shared-lockers count is
incremented/ decremented, spinlock is released). Similar to
pg_lwlock, but without the sleep() in the contested path of
the spinlock. The contested case of the RW-Lock is again
unimplemented.
none: No locking.
pg_lwlock: Postgres LWLock implementation. The contested case doesn't
work because the workers currently don't have a valid MyProc.
pw_lwlock_cas: Like pg_lwlock but with your CAS patch applied.
On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get
the following results:
| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09|
atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07|
cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07|
spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06|
pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08|
pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07|
_cas| | | | |
--------------------------------------------------------------------------
These results allow the following conclusions to be drawn
First, our current pg_lwlock is quite good at preserving resources, i.e.
at not spinning excessively - at least for up to four workers. It's the only
lock implementation that consistently uses about 30ns of processor time for
one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline
fighting) manages to beat that. It does, however, effectively serializate
execution with this workload - the consumed user time is approximately equal
to the elapsed wall clock time, even in the case of 4 workers, meaning that
most of the time 3 of those 4 workers are sleeping.
Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no
surprise, since use effectively the same algorithm. One could expect spin
and pg_lwlock to also perform similarly, but these two don't. The reason
is probably that spin uses a rather brain-dead spinning method, while
pg_lwlock uses "rep; nop".
Now to atomicinc, which is (the fast-path of) the most optimized RW-lock
possible, at least on x86 and x86_64. One interesting result here is that
wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock.
This proves, I think, that the limiting factor in both of these tests is
the speed at which cache line invalidation can occur. It doesn't seem to
matter whether the other cores are idle (as in the pg_lwlock case), or
trying to obtain ownership of the cache line themselves (as in the
atomicinc case).
Finally, the no-locking test (none) shows that, even though the counter
is written to shared memory after every increment, memory bandwith isn't
an issue here, since performance scaled nearly linearly with the number
of workers here.
Here's the result of another run, this time with the per-worker counter
being increment 100 in each acquire/release cycle.
------------------- 100 Increments per Cycle -----------------------------
| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07|
atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07|
cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07|
spin |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06|
pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07|
pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07|
_cas| | | | |
--------------------------------------------------------------------------
This makes the overhead of the locking disappear in the noise in the
single-worker case. With 4 workers, however, differences materialize.
The no-locking case still shows approximately linear speedup though,
which I think again rules out memory bandwith effects.
Now, atomicinc is the only implementation which doesn't perform
significantly worse than the no-lock case. The closest competitors are
the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and
33% slowdown respectively (of both wall seconds and user seconds per cycle).
The current LWLock implementation falls behind now, taking about twice as
many seconds per cycle compared to atomicinc.
In other words, with this workload an atomicinc-based RW-lock, taken in
SHARE mode should add virtually no overhead, while every other
implementation adds at least 20%.
It'd be interesting to get benchmark results also on machines with more
than four cores, but I don't have access to any such machine ATM. So
if someone could run that on a 8 or 16-core beast, that'd be great. The
code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to
make the postgres parts compile on other platforms, it might be necessary
to adjust pg_stub/pg_config_manual.h.
best regards,
Florian Pflug
On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:
[ testing of various spinlock implementations ]
I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
loaner from Nate Boley:
100 counter increments per cycle
worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06
I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:
wall-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
user-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
wall-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),spin(1.9),pg_lwlock_cas(1.9),cmpxchng(2.3)
user-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),pg_lwlock_cas(1.9),spin(1.9),cmpxchng(2.3)
wall-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.7),pg_lwlock(1.9),spin(2.0),cmpxchng(4.0)
user-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.8),pg_lwlock(1.9),spin(2.0),cmpxchng(4.1)
wall-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(6.9),pg_lwlock(9.3),spin(28.6)
user-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(7.0),pg_lwlock(9.4),spin(28.5)
wall-16: none(1.0),atomicinc(7.1),pg_lwlock_cas(9.0),cmpxchng(20.0),pg_lwlock(76.2),spin(181.0)
user-16: none(1.0),atomicinc(7.0),pg_lwlock_cas(9.1),cmpxchng(20.0),pg_lwlock(75.8),spin(184.8)
wall-32: none(1.0),atomicinc(13.6),pg_lwlock_cas(21.8),cmpxchng(40.9),pg_lwlock(581.8),spin(1272.7)
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
Here's the result of the same calculation applied to your second set of data:
wall-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
user-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
wall-2: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),spin(1.2),pg_lwlock_cas(1.2)
user-2: none(1.0),cmpxchng(1.0),pg_lwlock(1.0),atomicinc(1.0),spin(1.2),pg_lwlock_cas(1.2)
wall-3: none(1.0),pg_lwlock(1.0),atomicinc(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
user-3: none(1.0),atomicinc(1.0),pg_lwlock(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
wall-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.2)
user-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.4)
There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention. I'm not even sure
that you can see this problem on a real workload on a 16-core system.
But as you add cores the problem gets rapidly worse, because the more
cores you have, the more likely it is that someone else is already
holding the spinlock (or has just very recently held the spinlock)
that you want. And, of course, the problem multiplies itself, because
once your lock acquistions start taking longer, they become a larger
percentage of your execution time, which makes it proportionately more
likely that you'll find someone already there when you go to grab the
lock.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?
merlin
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?
Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jun28, 2011, at 23:48 , Robert Haas wrote:
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(
Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?
best regards,
Florian Pflug
On Tue, Jun 28, 2011 at 5:55 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun28, 2011, at 23:48 , Robert Haas wrote:
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)
I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?
pgbench -S
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jun28, 2011, at 22:18 , Robert Haas wrote:
On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:
[ testing of various spinlock implementations ]
I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
loaner from Nate Boley:
Cool, thanks!
100 counter increments per cycle
worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06
Here's the same table, formatted with spaces.
worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06
And here's the throughput table calculated from your results,
i.e. the wall time per cycle relative to the wall time per cycle
for 1 worker.
workers 2 4 8 16 32
none 1.9 3.5 6.7 13 26
atomicinc 1.4 2.6 2.6 2.4 2.4
cmpxchng 1.1 1.1 1.2 0.9 0.8
spin 1.5 2.6 0.3 0.1 0.0
pg_lwlock 1.4 2.5 1.0 0.2 0.1
pg_lwlock_cas 1.3 2.6 2.6 1.9 1.5
Hm, so in the best case we get 2.6x the throughput of a single core,
and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7).
In that case, there also seems to be little difference between
pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least
sustain that throughput when the worker count is increased, while
for for the others the throughput actually *decreases*.
What totally puzzles me is that your results don't show any
trace of a decreased system load for the pg_lwlock implementation,
which I'd have expected due to the sleep() in the contested
path. Here are the user vs. wall time ratios - I'd have expected
to see value significantly below the number of workers for pg_lwlock
none 1.0 2.0 4.0 7.9 16 31
atomicinc 1.0 2.0 3.9 7.9 15 33
cmpxchng 1.0 2.0 4.1 7.9 16 31
spin 1.0 2.0 3.9 7.8 16 31
pg_lwlock 1.0 2.0 4.1 7.9 16 31
pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31
I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:
If you're OK with that, I'd like to add that to the lockbench
repo.
There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention.
Hm, it starts to show if you lower the counter increment per cycle
(the D constant in run.sh). But yeah, it's never as bad as the
32-core results above..
best regards,
Florian Pflug
On Tue, Jun 28, 2011 at 8:11 PM, Florian Pflug <fgp@phlo.org> wrote:
I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:If you're OK with that, I'd like to add that to the lockbench
repo.
It's a piece of crap, but you're welcome to try to fix it up. See attached.
With respect to the apparent lack of impact of the sleep loop, I can
only speculate that this test case doesn't really mimic real-world
conditions inside the backend very well. There is a lot more going on
there - multiple locks, lots of memory access, etc. But I don't
really understand it either.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun12, 2011, at 23:39 , Robert Haas wrote:
So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.
I did some benchmarking, on the 32-core system from Nate Boley, with
pgbench -n -S -c 80 -j 80. With master+fastlock+lazyvxid, I can hit
180-200k TPS in the 32-client range. But at 80 clients throughput
starts to fall off quite a bit, dropping down to about 80k TPS. So
then, just for giggles, I inserted a "return;" statement at the top of
AcceptInvalidationMessages(), turning it into a noop. Performance at
80 clients shot up to 210k TPS. I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good. That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.
I did some other experiments, too. In the hopes of finding a general
way to reduce spinlock contention, I implemented a set of "elimination
buffers", where backends that can't get a spinlock go and try to
combine their requests and then send a designated representative to
acquire the spinlock and acquire shared locks simultaneously for all
group members. It took me a while to debug the code, and I still
can't get it to do anything other than reduce performance, which may
mean that I haven't found all the bugs yet, but I'm not feeling
encouraged. Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat. I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears. Then things are OK again for a bit until
the same thing happens to some other backend. This is just a theory,
I might be totally wrong...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Jul7, 2011, at 03:35 , Robert Haas wrote:
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:
On Jun12, 2011, at 23:39 , Robert Haas wrote:
So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good. That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.
I did some research and found "TLRW: Return of the Read-Write Lock"
by D. Dice and N. Shavit. PDF can be found here
http://transact09.cs.washington.edu/19_paper.pdf
They describe a read-write lock called "bytelock" with set of "slots"
for shared lockers, where each shared locker only needs to increment his
slot, and thus doesn't interfere with other concurrent shared locking
attempts. The price is that, after incrementing its slot, a shared
locker must issue a fencing instruction and then verify that no writer
has taken the lock in the mean while. In their application, they
distinguish between "slotted" threads - those who own a designated
slot, and "unslotted" thread - those who operation on the normal
shared counter.
I implemented that idea, but with a few modifications. First, I don't
distinguish between "slotted" and "unslotted" threads, but allow
multiple backends to share a slot. This means my implementation cannot
use a simple unlocked increment to update the slot, but instead uses
"locked xadd". I also moved the slots out of the lock itself, and into
a separate set of LWLockPart structures. Each such structure contains
the counters for one slot id and multiple locks.
In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.
I've added the implementation to the lock benchmarking tool at
https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
https://github.com/fgp/postgres/tree/lwlock_part
The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.
Benchmarking results look promising so far. This is the through-put
I get, in acquisition/release cycles per second, for the LWLock
implementations on a 4-core machine. pg_lwlock_part,N,X is the
partitioned lock with N lock partitions and using LOCKED XADD if
X == yes. The numbers are normalized to the through-put in the
single-worker case.
---------- Cycles / Second (Wall-Time) ---------------------
1 2 4 8 16 worker
wall wall wall wall wall time
1 1.9 3.9 3.9 3.9 none
1 0.2 0.9 0.8 0.6 pg_lwlock_orig
1 0.4 0.3 0.3 0.3 pg_lwlock_cas
1 0.3 1.0 0.8 0.6 pg_lwlock_part,1,no
1 0.4 0.4 0.4 0.4 pg_lwlock_part,1,yes
1 2.0 0.4 0.4 0.3 pg_lwlock_part,2,no
1 2.0 0.7 0.8 0.7 pg_lwlock_part,2,yes
1 2.0 4.1 2.1 1.0 pg_lwlock_part,4,no
1 2.1 3.8 1.8 2.2 pg_lwlock_part,4,yes
These numbers show that as long is the number of workers does not
exceed the number of lock partitions, throughput increases approximately
linearly. It drops off afterwards, but I that's to be expected -
the test executes LQLockAcquire/Release in a tight loop, so once there's
any kind of contention (even cache-line bouncing), performance drops.
This is also why the original LWLock beats CAS and the partitioned
lock with less than 4 partitions here - effectively, at least half of
the workers are sleeping at any given time, as the following
user vs. wall time numbers show.
---------- User-Time / Wall-Time ----------------------------
1 2 4 8 16 worker
1.0 1.9 3.9 3.9 3.9 none
1.0 1.9 1.1 1.3 1.8 pg_lwlock_orig
1.0 2.0 4.0 4.0 4.0 pg_lwlock_cas
1.0 1.9 1.1 1.4 1.8 pg_lwlock_part,1,no
1.0 2.0 4.0 3.9 4.0 pg_lwlock_part,1,yes
1.0 2.0 3.6 3.8 3.6 pg_lwlock_part,2,no
1.0 2.0 3.8 3.9 3.9 pg_lwlock_part,2,yes
1.0 2.0 4.1 4.1 3.9 pg_lwlock_part,4,no
1.0 2.0 4.0 3.9 3.9 pg_lwlock_part,4,yes
I lack the hardware to produce any meaningful benchmark results
with pgbench for this, so again I'd be very cool if someone
(Robert? ;-) who has access to that kind of hardware could give the
patched version from github a spin with pgbench.
The patched version is current set of 4 shared counter partitions,
and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC).
I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears. Then things are OK again for a bit until
the same thing happens to some other backend. This is just a theory,
I might be totally wrong...
Hm, but one would expect most of the spin lock contesters to have
given up and gone to sleep by the time the interrupted lock holder
resumes. If that is not what happens, then maybe we should revisit
the logic in SpinLockAcquire()...
best regards,
Florian Pflug