spinlock contention

Started by Robert Haasover 14 years ago33 messages

robertmhaas@gmail.com

over 14 years ago

On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode. Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

If anyone has read this far, you have my undying gratitude....
especially if you respond with comments.

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Robert Haas (#1)

Re: spinlock contention

On 23.06.2011 18:42, Robert Haas wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

ProcArrayLock is currently held for a relatively long period of time
when a snapshot is taken, especially if there's a lot of backends. There
are three operations to the procarray:

1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that
setting an xid in shared memory is atomic. To end a transaction, you
acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire
ProcArrayLock in shared mode, and scan the whole procarray.

I wonder if it would be a better tradeoff to keep a "materialized"
snapshot in shared memory that's updated every time a new xid is
assigned or a transaction ends. Getting a snapshot would become much
cheaper, as you could just memcpy the ready-built snapshot from shared
memory into local memory.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#1)

Re: spinlock contention

On Jun23, 2011, at 17:42 , Robert Haas wrote:

On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.

There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...

best regards,
Florian Pflug

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#2)

Re: spinlock contention

On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 23.06.2011 18:42, Robert Haas wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

ProcArrayLock is currently held for a relatively long period of time when a
snapshot is taken, especially if there's a lot of backends. There are three
operations to the procarray:

1. Advertising a new xid belonging to my backend.
2. Ending a transaction.
3. Getting a snapshot.

Advertising a new xid is currently done without a lock, assuming that
setting an xid in shared memory is atomic.

The issue isn't just whether it's atomic, but whether some other
backend scanning the ProcArray just afterwards might fail to see the
update due to memory-ordering effects. But I think even if they do,
there's no real harm done. Since we're holding XidGenLock, we know
that the XID we are advertising is higher than any XID previously
advertised, and in particular it must be higher than
latestCompletedXid, so GetSnapshotdata() will ignore it anyway.

I am rather doubtful that this code is strictly safe:

myproc->subxids.xids[nxids] = xid;
myproc->subxids.nxids = nxids + 1;

In practice, the window for a race condition is minimal, because (a)
in many cases, nxids will be on the same cache line as the xid being
set; (b) even if it's not, we almost immediately release XidGenLock,
which acts as a memory barrier; (c) on many common architectures, such
as x86, stores are guaranteed to become visible in execution order;
(d) if, on some architecture, you manged to see the stores out of
order and thus loaded a garbage XID from the end of the array, it
might happen to be zero (which would be ignored, as a non-normal
transaction ID) or the transaction might happen never to examine a
tuple with that XID anyway.

To end a transaction, you acquire
ProcArrayLock in exclusive mode. To get a snapshot, you acquire
ProcArrayLock in shared mode, and scan the whole procarray.

I wonder if it would be a better tradeoff to keep a "materialized" snapshot
in shared memory that's updated every time a new xid is assigned or a
transaction ends. Getting a snapshot would become much cheaper, as you could
just memcpy the ready-built snapshot from shared memory into local memory.

At least in the case of the pgbench -S workload, I doubt that would
help at all. The problem isn't that the LWLocks are being held for
too long -- they are all being held in shared mode and don't conflict
anyway. The problem is that everyone has to acquire and release the
spinlock in order to acquire the LWLock, and again to release it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Merlin Moncure

mmoncure@gmail.com

over 14 years ago

In reply to: Florian Pflug (#3)

Re: spinlock contention

On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun23, 2011, at 17:42 , Robert Haas wrote:

On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

SInvalReadLock is a good example of a lock which seems barely to be
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.

There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...

yup, 'lock cmpxchg8b'.
see: http://www.niallryan.com/node/137 for a way to do it via fpu
without a loop. the 'lock' is a bus lock.

merlin

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#3)

Re: spinlock contention

On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle. That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

SInvalReadLock is a good example of a lock which seems barely to e
necessary at all. Except on extraordinarily rare occasions, it is
only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

Only SICleanupQueue() needs to lock
out readers, and in the (probably common) case where all backends are
reasonably close to being caught up, it wouldn't even matter if the
determination of minMsgNum were a little bit fuzzy. I've been
straining my brain trying to figure out whether there's some way to
rewrite this logic so that it can run concurrently with readers; then
we could get rid of SInvalReadLock altogether. I have a feeling if I
were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

Interesting idea.

So my fallback position is to try to find some way to optimize the
common case where there are no messages to be read. We're already
allowing readers and writers to run in parallel, so it must not be
important for a reader to see a message that a writer is in the
process of writing, but has not yet finished writing. I believe the
idea here is that sinval messages should be emitted before releasing
the related locks, so if we've successfully acquired a lock on an
object, then any pertinent sinval messages must already be completely
written. What I'm thinking we could do is just keep a global counter
indicating how many messages have been written, and each backend could
keep a counter indicating the highest-numbered message it has so far
read. When a message is written, the global counter is bumped. When
a backend goes to AcceptInvalidationMessages(), it compares the global
counter to its own counter, and if they're the same, then it doesn't
bother taking SInvalReadLock and just exits quickly.

Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.

I was hoping to do it that way, too, but I have so far been able to
demonstrate a performance benefit from that approach.

There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.

Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.

A possible objection here is that a 32-bit counter might wrap around,
giving a false negative, and a read from a 64-bit counter might not be
atomic. In practice, the chances of a problem here seem remote, but I
think we can guard against it easily enough. We can put each
backend's counter of messages read into shared memory, protected by a
per-backend lock (probably the same LWLock the fast relation lock
patch already introduces). When a backend compares its counter to the
global counter, it does so while holding this lock. When
SICleanupQueue() resets a backend, it grabs the lock and sets that
backend's counter to a special value (like 0) that we guarantee will
never match the global counter (which we can cause to wrap from 2^32-1
to 1). So then AcceptInvalidationMessages() - in the common case
where there are no invalidation messages to process - will only need
the per-backend LWLock rather than fighting over SInvalLock.

OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...

Maybe so... but if I can get by without needing that, so much the
better. I don't really want to have to have a separate code path for
architectures that don't support that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#6)

Re: spinlock contention

On Jun23, 2011, at 22:15 , Robert Haas wrote:

On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote:

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
}
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

There is an obvious (potential) memory-ordering problem here. If it's
possible for a backend doing AcceptInvalidationMessages() to see a
stale version of the counter, then it might fail to read messages from
the queue that it really ought to have seen. Protecting the global
counter with a spinlock defeats the purpose of doing any of this in
the first place, because the whole problem is too many people fighting
over the spinlock protecting SInvalReadLock. It should be sufficient,
though, for the writing backend to execute a memory-barrier operation
just after bumping the global counter and the reading backend to
execute one just before. As it happens, LWLockRelease() is such a
barrier, since it must acquire and release a spinlock, so we should be
able to just bump the counter right before releasing the LWLock and
call it good. On the reading side, the immediately-preceding lock
acquisition seems like it ought to be sufficient - surely it can't be
possible to acquire a heavyweight lock on an object without seeing
memory updates that some other process made before releasing the same
heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.

Yes, a set of general memory barrier primitives would be handy.
Unfortunately, it would probably be quite a bit of work to develop
this and verify that it works properly on all supported platforms.

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

best regards,
Florian Pflug

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#7)

1 attachment(s)

Re: spinlock contention

On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug <fgp@phlo.org> wrote:

Well, I'm sure there is some effect, but my experiments seem to
indicate that it's not a very important one. Again, please feel free
to provide contrary evidence. I think the basic issue is that - in
the best possible case - padding the LWLocks so that you don't have
two locks sharing a cache line can reduce contention on the busier
lock by at most 2x. (The less busy lock may get a larger reduction
but that may not help you much.) If you what you really need is for
the contention to decrease by 1000x, you're just not really moving the
needle.

Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.

Hey, if we can show that it works, sign me up.

That's why the basic fast-relation-lock patch helps so much:
it replaces a system where every lock request results in contention
with a system where NONE of them do.

I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
}
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).

Attached...

The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.

Yes. But if we go that route, then we have to also support a code
path for architectures for which we don't have that support. That's
going to be more work, so I don't want to do it until we have a case
where there is a good, clear benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

lwlock-v1.patchapplication/octet-stream; name=lwlock-v1.patchDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 32cb819..dae33c6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -27,6 +27,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "storage/atomic.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/spin.h"
@@ -36,18 +37,58 @@
 extern slock_t *ShmemLock;
 
 
+/*
+ * If compare-and-swap is not available, the mutex protects the entire LWLock.
+ * But when compare-and-swap is available, the state is manipulated using
+ * cas_int32() and only the remaining members are protected by the mutex.
+ */
 typedef struct LWLock
 {
-	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
+	slock_t		mutex;			/* See notes above */
+	int32		state;			/* Lock state */
 	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	int			shared;			/* # of shared holders (0..MaxBackends) */
 	PGPROC	   *head;			/* head of list of waiting PGPROCs */
 	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
 } LWLock;
 
 /*
+ * A lightweight lock can be unlocked, exclusively locked (by exactly one
+ * locker), or share-locked (by one or more lockers).  We maintain the state of
+ * the lock in a single integer variable so that it can be atomically
+ * updated using compare-and-swap, where available.
+ */
+#define			LWSTATE_UNLOCKED			0
+#define			LWSTATE_EXCLUSIVE			(-2)
+#define			LWSTATE_SHARED				2
+#define			LWSTATE_IS_EXCLUSIVE(n)		((n) < 0)
+#define			LWSTATE_IS_SHARED(n)		((n) > 1)
+#ifdef HAVE_COMPARE_AND_SWAP
+#define			LWSTATE_IS_UNLOCKED(n)		((n & ~1) == 0)
+#else
+#define			LWSTATE_IS_UNLOCKED(n)		((n) == 0)
+#endif
+
+#ifdef HAVE_COMPARE_AND_SWAP
+/*
+ * When HAVE_COMPARE_AND_SWAP is set, we use the low bit of the state to
+ * track whether waiters are present.  This allows the lwlock to be released
+ * without taking the mutex, unless there are actually waiters that need to
+ * be woken up.
+ */
+#define			LWSTATE_WAITERS				1
+#define			LWSTATE_HAS_WAITERS(n)		((n) & 1)
+/* Last shared or exclusive lock requires a wakeup. */
+#define			LWSTATE_NEEDS_WAKEUP(n)		((n) == -1 || (n) == 3)
+
+static bool LWLockCASAcquire(volatile LWLock *lock, LWLockMode mode,
+				 int wait_ok);
+static bool LWLockCASRelease(volatile LWLock *lock, LWLockMode mode,
+				 int wake_ok);
+static void LWLockCASClearWaitMark(volatile LWLock *lock);
+#endif
+
+/*
  * All the LWLock structs are allocated as an array in shared memory.
  * (LWLockIds are indexes into the array.)	We force the array stride to
  * be a power of 2, which saves a few cycles in indexing, but more
@@ -85,6 +126,9 @@ NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
 
 static int	num_held_lwlocks = 0;
 static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
+#ifdef HAVE_COMPARE_AND_SWAP
+static LWLockMode held_lwlock_modes[MAX_SIMUL_LWLOCKS];
+#endif
 
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
@@ -260,8 +304,7 @@ CreateLWLocks(void)
 	{
 		SpinLockInit(&lock->lock.mutex);
 		lock->lock.releaseOK = true;
-		lock->lock.exclusive = 0;
-		lock->lock.shared = 0;
+		lock->lock.state = LWSTATE_UNLOCKED;
 		lock->lock.head = NULL;
 		lock->lock.tail = NULL;
 	}
@@ -360,6 +403,15 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
+#ifdef HAVE_COMPARE_AND_SWAP
+	/*
+	 * If we have compare-and-swap, we can try to grap the lock using an atomic
+	 * operation, without actually taking the mutex lock.
+	 */
+	if (LWLockCASAcquire(lock, mode, false))
+		goto lock_acquired;
+#endif
+
 	/*
 	 * Loop here to try to acquire lock after each time we are signaled by
 	 * LWLockRelease.
@@ -378,7 +430,7 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	for (;;)
 	{
-		bool		mustwait;
+		bool		acquired;
 
 		/* Acquire mutex.  Time spent holding mutex should be short! */
 		SpinLockAcquire(&lock->mutex);
@@ -387,30 +439,36 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 		if (retry)
 			lock->releaseOK = true;
 
+#ifdef HAVE_COMPARE_AND_SWAP
+		/* Either acquire the mutex, or mark it as having waiters. */
+		acquired = LWLockCASAcquire(lock, mode, true);
+#else
+
 		/* If I can get the lock, do so quickly. */
 		if (mode == LW_EXCLUSIVE)
 		{
-			if (lock->exclusive == 0 && lock->shared == 0)
+			if (LWSTATE_IS_UNLOCKED(lock->state))
 			{
-				lock->exclusive++;
-				mustwait = false;
+				lock->state = LWSTATE_EXCLUSIVE;
+				acquired = true;
 			}
 			else
-				mustwait = true;
+				acquired = false;
 		}
 		else
 		{
-			if (lock->exclusive == 0)
+			if (!LWSTATE_IS_EXCLUSIVE(lock->state))
 			{
-				lock->shared++;
-				mustwait = false;
+				lock->state += LWSTATE_SHARED;
+				acquired = true;
 			}
 			else
-				mustwait = true;
+				acquired = false;
 		}
+#endif
 
-		if (!mustwait)
-			break;				/* got the lock */
+		if (acquired)
+			break;								/* got the lock */
 
 		/*
 		 * Add myself to wait queue.
@@ -474,9 +532,17 @@ LWLockAcquire(LWLockId lockid, LWLockMode mode)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
+#ifdef HAVE_COMPARE_AND_SWAP
+	/* The mutex-free fast-path jumps here if it acquires the lock. */
+lock_acquired:
+#endif
+
 	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
 
 	/* Add lock to list of locks held by this backend */
+#ifdef HAVE_COMPARE_AND_SWAP
+	held_lwlock_modes[num_held_lwlocks] = mode;
+#endif
 	held_lwlocks[num_held_lwlocks++] = lockid;
 
 	/*
@@ -497,7 +563,7 @@ bool
 LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 {
 	volatile LWLock *lock = &(LWLockArray[lockid].lock);
-	bool		mustwait;
+	bool		acquired;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
 
@@ -512,35 +578,50 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	 */
 	HOLD_INTERRUPTS();
 
-	/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef HAVE_COMPARE_AND_SWAP
+
+	/*
+	 * If we're using compare-and-swap, LWLockCASAcquire() does all the heavy
+	 * lifting on our behalf.
+	 */
+	acquired = LWLockCASAcquire(lock, mode, false);
+
+#else
+
+	/*
+	 * We don't have compare-and-swap, and must acquire the mutex to check
+	 * and update the lock state.  Time spent holding mutex should be short!
+	 */
 	SpinLockAcquire(&lock->mutex);
 
 	/* If I can get the lock, do so quickly. */
 	if (mode == LW_EXCLUSIVE)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0)
+		if (LWSTATE_IS_UNLOCKED(lock->state))
 		{
-			lock->exclusive++;
-			mustwait = false;
+			lock->state = LWSTATE_EXCLUSIVE;
+			acquired = true;
 		}
 		else
-			mustwait = true;
+			acquired = false;
 	}
 	else
 	{
-		if (lock->exclusive == 0)
+		if (LWSTATE_IS_UNLOCKED(lock->state))
 		{
-			lock->shared++;
-			mustwait = false;
+			lock->state += LWSTATE_SHARED;
+			acquired = true;
 		}
 		else
-			mustwait = true;
+			acquired = false;
 	}
 
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	if (mustwait)
+#endif
+
+	if (!acquired)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
@@ -550,11 +631,14 @@ LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
 	else
 	{
 		/* Add lock to list of locks held by this backend */
+#ifdef HAVE_COMPARE_AND_SWAP
+		held_lwlock_modes[num_held_lwlocks] = mode;
+#endif
 		held_lwlocks[num_held_lwlocks++] = lockid;
 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
 	}
 
-	return !mustwait;
+	return acquired;
 }
 
 /*
@@ -567,6 +651,10 @@ LWLockRelease(LWLockId lockid)
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
+	bool		wakeup;
+#ifdef HAVE_COMPARE_AND_SWAP
+	LWLockMode	mode;
+#endif
 
 	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
 
@@ -581,21 +669,47 @@ LWLockRelease(LWLockId lockid)
 	}
 	if (i < 0)
 		elog(ERROR, "lock %d is not held", (int) lockid);
+#ifdef HAVE_COMPARE_AND_SWAP
+	mode = held_lwlock_modes[i];
+#endif
 	num_held_lwlocks--;
 	for (; i < num_held_lwlocks; i++)
+	{
 		held_lwlocks[i] = held_lwlocks[i + 1];
+#ifdef HAVE_COMPARE_AND_SWAP
+		held_lwlock_modes[i] = held_lwlock_modes[i + 1];
+#endif
+	}
+
+#ifdef HAVE_COMPARE_AND_SWAP
+	/*
+	 * If no waiters need to be awoken, we can drop the LWLock without needing
+	 * to acquire the spinlock.
+	 */
+	if (!LWLockCASRelease(lock, mode, false))
+	{
+		head = NULL;
+		goto fast_exit;
+	}
+#endif
 
 	/* Acquire mutex.  Time spent holding mutex should be short! */
 	SpinLockAcquire(&lock->mutex);
 
+#ifdef HAVE_COMPARE_AND_SWAP
+	/* LWLockCASRelease does all the hard work. */
+	wakeup = LWLockCASRelease(lock, mode, true);
+#else
 	/* Release my hold on lock */
-	if (lock->exclusive > 0)
-		lock->exclusive--;
+	if (LWSTATE_IS_EXCLUSIVE(lock->state))
+		lock->state = LWSTATE_UNLOCKED;
 	else
 	{
-		Assert(lock->shared > 0);
-		lock->shared--;
+		Assert(LWSTATE_IS_SHARED(lock->state));
+		lock->state -= LWSTATE_SHARE_INCREMENT;
 	}
+	wakeup = LWSTATE_IS_UNLOCKED(lock->state);
+#endif
 
 	/*
 	 * See if I need to awaken any waiters.  If I released a non-last shared
@@ -606,7 +720,7 @@ LWLockRelease(LWLockId lockid)
 	head = lock->head;
 	if (head != NULL)
 	{
-		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
+		if (wakeup && lock->releaseOK)
 		{
 			/*
 			 * Remove the to-be-awakened PGPROCs from the queue.  If the front
@@ -625,6 +739,14 @@ LWLockRelease(LWLockId lockid)
 			proc->lwWaitLink = NULL;
 			/* prevent additional wakeups until retryer gets to run */
 			lock->releaseOK = false;
+
+#ifdef HAVE_COMPARE_AND_SWAP
+			/*
+			 * Clear the waiters bit if we removed all of the waiters.
+			 */
+			if (lock->head == NULL)
+				LWLockCASClearWaitMark(lock);
+#endif
 		}
 		else
 		{
@@ -636,6 +758,9 @@ LWLockRelease(LWLockId lockid)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
+#ifdef HAVE_COMPARE_AND_SWAP
+fast_exit:
+#endif
 	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
 
 	/*
@@ -697,3 +822,131 @@ LWLockHeldByMe(LWLockId lockid)
 	}
 	return false;
 }
+
+#ifdef HAVE_COMPARE_AND_SWAP
+
+/*
+ * LWLockCASAcquire - attempt to acquire an LWLock using compare-and-swap
+ *
+ * Returns true if the LWLock has been acquired, false if not.
+ *
+ * If wait_ok is true, and the lock cannot be acquired, try to set the waiters
+ * bit instead.
+ */
+static bool
+LWLockCASAcquire(volatile LWLock *lock, LWLockMode mode, int wait_ok)
+{
+	int			current_guess = LWSTATE_UNLOCKED;
+
+	while (1)
+	{
+		int		delta;
+		int		result;
+
+		/* Compute delta to apply, or give up! */
+		if (mode == LW_EXCLUSIVE)
+		{
+			if (LWSTATE_IS_UNLOCKED(current_guess))
+				delta = LWSTATE_EXCLUSIVE;
+			else if (LWSTATE_HAS_WAITERS(current_guess) || !wait_ok)
+				return false;
+			else
+				delta = LWSTATE_WAITERS;
+		}
+		else
+		{
+			if (!LWSTATE_IS_EXCLUSIVE(current_guess))
+				delta = LWSTATE_SHARED;
+			else if (LWSTATE_HAS_WAITERS(current_guess) || !wait_ok)
+				return false;
+			else
+				delta = LWSTATE_WAITERS;
+		}
+
+		/* Attempt to add delta. */
+		result = cas_int32(&lock->state, current_guess, current_guess + delta);
+
+		/* If it worked, we're done. */
+		if (current_guess == result)
+			return (delta != LWSTATE_WAITERS);
+
+		/* We are unlucky and must retry. */
+		current_guess = result;
+	}
+}
+
+/*
+ * LWLockCASRelease - release an LWLock using compare-and-swap
+ *
+ * If wake_ok is true, the lock is released always; if false, it's released
+ * only when no wakeups are required.  The return value is true if wakeups
+ * are required and false otherwise.
+ */
+static bool
+LWLockCASRelease(volatile LWLock *lock, LWLockMode mode, int wake_ok)
+{
+	int			current_guess;
+	int			delta;
+
+	if (mode == LW_EXCLUSIVE)
+	{
+		current_guess = LWSTATE_EXCLUSIVE;
+		delta = LWSTATE_EXCLUSIVE;
+	}
+	else
+	{
+		current_guess = LWSTATE_SHARED;
+		delta = LWSTATE_SHARED;
+	}
+
+	while (1)
+	{
+		int		result;
+
+		/* Attempt to subtract delta. */
+		result = cas_int32(&lock->state, current_guess, current_guess - delta);
+		if (current_guess == result)
+			return LWSTATE_NEEDS_WAKEUP(result);
+
+		/* Be careful about releasing the last lock. */
+		if (!wake_ok && LWSTATE_NEEDS_WAKEUP(result))
+			return true;
+
+#ifdef USE_ASSERT_CHECKING
+		if (mode == LW_EXCLUSIVE)
+			Assert(LWSTATE_IS_EXCLUSIVE(result));
+		else
+			Assert(LWSTATE_IS_SHARED(result));
+#endif
+
+		/* We are unlucky and must retry. */
+		current_guess = result;
+	}
+}
+
+/*
+ * LWLockCASClearWaitMark - remove waiters mark from lock state
+ */
+static void
+LWLockCASClearWaitMark(volatile LWLock *lock)
+{
+	int			current_guess = LWSTATE_UNLOCKED + LWSTATE_WAITERS;
+
+	while (1)
+	{
+		int		result;
+
+		/* Attempt to subtract delta. */
+		result = cas_int32(&lock->state, current_guess,
+						   current_guess - LWSTATE_WAITERS);
+		if (current_guess == result)
+			return;
+
+		Assert(LWSTATE_HAS_WAITERS(result));
+
+		/* We are unlucky and must retry. */
+		current_guess = result;
+	}
+}
+
+#endif
diff --git a/src/include/storage/atomic.h b/src/include/storage/atomic.h
new file mode 100644
index 0000000..a1f909f
--- /dev/null
+++ b/src/include/storage/atomic.h
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * atomic.h
+ *	   Hardware-dependent atomic primitives.
+ *
+ *	NOTE: No code must rely absolutely on the availability of the
+ *  primitives in this file.  On architectures where don't have a
+ *  working implementation, or where they are not supported, use
+ *  spinlocks or some other mechanism.
+ * 
+ *	CAUTION: Be sure that any operation defined here represents a sequence
+ *  point - ie, loads and stores of other values must not be moved across
+ *	a lock or unlock.  In most cases it suffices to make the operation be
+ *	done through a "volatile" pointer.
+ *
+ * Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/include/storage/atomic.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ATOMIC_H
+#define ATOMIC_H
+
+#if defined(__GNUC__) || defined(__INTEL_COMPILER)
+
+/* 32-bit i386, AMD Opteron, Intel EM64T */
+#if defined(__i386__) || defined(__x86_64__)
+
+#define HAVE_COMPARE_AND_SWAP	1
+
+/*
+ * We might someday want to have more than one CAS variant (e.g. CAS is
+ * commonly used with pointers to build lock-free data structures), but int32
+ * is sufficient for now.
+ */
+static __inline__ int32
+cas_int32(volatile int32 *var, volatile int32 old, volatile int32 new)
+{
+	/*
+	 * cmpxchg compares EAX to its second operand (which must be a memory
+	 * location).  If they are equal, it writes its first operand to that
+	 * address; otherwise, it writes the original value back to that address.
+	 * In either case, the original value of the memory location is left in
+	 * EAX.  We therefore force GCC to allocate old in EAX (using the "a"
+	 * constraints).  We mark memory and the condition code register as
+	 * clobbered.  The former is not true but prevents GCC from reordering
+	 * other instructions around our atomic primitive; the latter is correct
+	 * (ZF will be set if old was equal to *var).
+	 */
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	cmpxchg %2,%1	\n"
+:		"+a" (old), "+m" (*var)
+:		"q" (new), "a" (old)
+:		"memory", "cc");
+	return old;
+}
+
+#endif	 /* defined(__i386__) || defined(__x86_64__) */
+
+#endif	 /* defined(__GNUC__) || defined(__INTEL_COMPILER) */
+
+#endif	 /* ATOMIC_H */

Greg Stark

stark@mit.edu

over 14 years ago

In reply to: Robert Haas (#1)

Re: spinlock contention

On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.

So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.

--
greg

#10

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Greg Stark (#9)

Re: spinlock contention

On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <stark@mit.edu> wrote:

On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

ProcArrayLock looks like a tougher nut to crack - there's simply no
way, with the system we have right now, that you can take a snapshot
without locking the list of running processes. I'm not sure what to
do about that, but we're probably going to have to come up with
something, because it seems clear that once we eliminate the lock
manager LWLock contention, this is a major bottleneck.

Well as Tom observed earlier the kernel of a snapshot is actually a
LSN. A snapshot contains a set of xids which all committed before some
LSN and none which committed after it.

So if we had a record of what log sequence number the commit record
for any given transaction is we could build the snapshot at our
leisure without any exclusive lock. In fact we could even build it
lazily as a kind of cache only when we actually are interested in a
given xid.

Yeah, I've been thinking about that. I think what we might do is set
up a new SLRU that works like CLOG, but each entry is an LSN rather
than just two bits. When a transaction commits, we save the commit
LSN under the entry for that XID. We truncate away SLRU pages that
contain no still-running XIDs. When we need to check whether an XID
is visible to our snapshot, we just look up the commit LSN and compare
it with our snapshot LSN. If it's before and non-zero, we can see it.
If it's after or all-zeroes, we can't.

But I'm not sure how much this would really help. It might (subject
to working out the details) make the actual process of taking a
snapshot faster. But it's not clear that taking snapshots more
quickly will actually help anything, because the problem is not the
amount of time spending taking the snapshot. The problem is rather
that doing so requires acquiring and releasing an LWLock, and each of
those operations requires taking and releasing a spinlock. And it is
the spinlock contention that is killing us. That makes me think we
need a way to take a snapshot without taking a spinlock. Or if we
must take spinlocks, we at least have to avoid every backend that
needs a snapshot lusting after the *same* spinlock.

What I've been thinking about this weekend is whether it might be
possible to create a sort of lock-free queue infrastructure. When a
backend starts up, it would add an entry to the queue saying "I'm
running". When it commits, it would add an entry to the queue saying
"I'm committed". All entries would be added at the *end* of the
queue, so a backend scanning the queue to build up a snapshot wouldn't
ever be able to see commits out of order. We would need some memory
barrier operations on weak-memory-ordering machines to make sure that
the queue writes became visible before the end-of-queue pointer bump.

The trick is figuring out how to clean up the queue. Since "commit"
entries exist only to guard against "running" entries earlier in the
queue, the start-of-queue pointer can be advanced whenever it points
to a "commit" entry. Also, if it points to a "running" entry for
which there is a later "commit" entry, then the start-of-queue pointer
can be advanced over that as well. However, just because we can
advance the point at which backends start reading doesn't mean that we
can actually recycle space, because while we know that new scans
needn't worry about those entries, we *don't* know that there isn't
already a scan in flight that still needs them. Furthermore, if a
transaction runs for a long time, we can never advance the
start-of-queue pointer past the "running" entry for its XID, which is
obviously problematic since the queue would get very long.

To work around that problem, I think we could use Florian's idea
upthread of an RCU system. We keep two copies of the queue around, an
A copy and a B copy. When the currently active copy fills up, we
rewrite it into the other queue, omitting all "committed" entries and
any "running" entries that have matching "committed" entries, and then
tell everyone to start looking at that copy instead. We would need
some kind of gymnastics to make sure that we don't flip from the A
copy to the B copy and back to the A copy while some laggardly backend
is still hoping to scan the old A copy. A simple algorithm (there's
probably a smarter one) would be to have each backend take a spinlock
while it's scanning either copy, and to have the backend that is doing
the rewrite take and release all of those spinlocks one at a time
before beginning the rewrite, thus guaranteeing that any scans still
in progress when the rewrite is requested have completed before it's
actually performed. Any new scans started in the meanwhile will
certainly be looking at the current copy rather than the old copy
we're about to overwrite.

We would still need a lock around the operation of adding new items to
the queue; if two backends try to do that at the same time, chaos will
ensue. But if that lock gets busy, it would be possible to have
backends use some sort of system of elimination buffers to combine
their requests. If ten backends each want to advertise a new XID, it
will be far faster to have one backend acquire the lock, write in all
ten entries, and wake everyone up than to have each backend insert its
entry and wake up the next one. So what we might do is doing a
condition-acquire on the lock to insert data into the buffer and, if
we can't get it immediately, go find another backend with the same
problem and elect a leader to do all the insertions. If we're the
leader, sleep on the queue-insertion lock. If not, sleep until the
leader wakes us up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#8)

Re: spinlock contention

On Jun23, 2011, at 23:40 , Robert Haas wrote:

I tried rewriting the LWLocks using CAS. It actually seems to make
things slightly worse on the tests I've done so far, perhaps because I
didn't make it respect spins_per_delay. Perhaps fetch-and-add would
be better, but I'm not holding my breath. Everything I'm seeing so
far leads me to the belief that we need to get rid of the contention
altogether, not just contend more quickly.

I got curious and put a tool together to benchmark this. The code can
be found at https://github.com/fgp/lockbench.

The tool starts N workers who each acquire a lock in SHARE mode, increment
a per-worker counter and release the lock again, rinse, repeat. That is
done for T seconds, after which it reports the sum of the individual
counters, the elapsed (wall) time and the sums of the user and system times
consumed by the workers. The lock implementations currently supported are

atomicinc: RW-Lock using atomic increment/decrement instructions
(locked xaddl) to acquire and release the lock in SHARE
mode in the non-contested case. The contested case is
unimplemented.

cmpxchng: Like atomicinc, but uses "locked cmpxchng" loop instead of
"locked xaddl".

spin: RW-Lock built around a simple cmpxchng-based spinlock (i.e.,
to lock/unlock spinlock is taken, shared-lockers count is
incremented/ decremented, spinlock is released). Similar to
pg_lwlock, but without the sleep() in the contested path of
the spinlock. The contested case of the RW-Lock is again
unimplemented.

none: No locking.

pg_lwlock: Postgres LWLock implementation. The contested case doesn't
work because the workers currently don't have a valid MyProc.

pw_lwlock_cas: Like pg_lwlock but with your CAS patch applied.

On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get
the following results:

| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09|
atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07|
cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07|
spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06|
pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08|
pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07|
_cas| | | | |
--------------------------------------------------------------------------

These results allow the following conclusions to be drawn

First, our current pg_lwlock is quite good at preserving resources, i.e.
at not spinning excessively - at least for up to four workers. It's the only
lock implementation that consistently uses about 30ns of processor time for
one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline
fighting) manages to beat that. It does, however, effectively serializate
execution with this workload - the consumed user time is approximately equal
to the elapsed wall clock time, even in the case of 4 workers, meaning that
most of the time 3 of those 4 workers are sleeping.

Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no
surprise, since use effectively the same algorithm. One could expect spin
and pg_lwlock to also perform similarly, but these two don't. The reason
is probably that spin uses a rather brain-dead spinning method, while
pg_lwlock uses "rep; nop".

Now to atomicinc, which is (the fast-path of) the most optimized RW-lock
possible, at least on x86 and x86_64. One interesting result here is that
wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock.
This proves, I think, that the limiting factor in both of these tests is
the speed at which cache line invalidation can occur. It doesn't seem to
matter whether the other cores are idle (as in the pg_lwlock case), or
trying to obtain ownership of the cache line themselves (as in the
atomicinc case).

Finally, the no-locking test (none) shows that, even though the counter
is written to shared memory after every increment, memory bandwith isn't
an issue here, since performance scaled nearly linearly with the number
of workers here.

Here's the result of another run, this time with the per-worker counter
being increment 100 in each acquire/release cycle.

------------------- 100 Increments per Cycle -----------------------------
| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07|
atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07|
cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07|
spin |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06|
pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07|
pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07|
_cas| | | | |
--------------------------------------------------------------------------

This makes the overhead of the locking disappear in the noise in the
single-worker case. With 4 workers, however, differences materialize.
The no-locking case still shows approximately linear speedup though,
which I think again rules out memory bandwith effects.

Now, atomicinc is the only implementation which doesn't perform
significantly worse than the no-lock case. The closest competitors are
the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and
33% slowdown respectively (of both wall seconds and user seconds per cycle).

The current LWLock implementation falls behind now, taking about twice as
many seconds per cycle compared to atomicinc.

In other words, with this workload an atomicinc-based RW-lock, taken in
SHARE mode should add virtually no overhead, while every other
implementation adds at least 20%.

It'd be interesting to get benchmark results also on machines with more
than four cores, but I don't have access to any such machine ATM. So
if someone could run that on a 8 or 16-core beast, that'd be great. The
code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to
make the postgres parts compile on other platforms, it might be necessary
to adjust pg_stub/pg_config_manual.h.

best regards,
Florian Pflug

#12

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#11)

Re: spinlock contention

On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:

[ testing of various spinlock implementations ]

I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
loaner from Nate Boley:

100 counter increments per cycle
worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06

I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:

wall-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
user-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5)
wall-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),spin(1.9),pg_lwlock_cas(1.9),cmpxchng(2.3)
user-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),pg_lwlock_cas(1.9),spin(1.9),cmpxchng(2.3)
wall-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.7),pg_lwlock(1.9),spin(2.0),cmpxchng(4.0)
user-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.8),pg_lwlock(1.9),spin(2.0),cmpxchng(4.1)
wall-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(6.9),pg_lwlock(9.3),spin(28.6)
user-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(7.0),pg_lwlock(9.4),spin(28.5)
wall-16: none(1.0),atomicinc(7.1),pg_lwlock_cas(9.0),cmpxchng(20.0),pg_lwlock(76.2),spin(181.0)
user-16: none(1.0),atomicinc(7.0),pg_lwlock_cas(9.1),cmpxchng(20.0),pg_lwlock(75.8),spin(184.8)
wall-32: none(1.0),atomicinc(13.6),pg_lwlock_cas(21.8),cmpxchng(40.9),pg_lwlock(581.8),spin(1272.7)
user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

Here's the result of the same calculation applied to your second set of data:

wall-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
user-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2)
wall-2: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),spin(1.2),pg_lwlock_cas(1.2)
user-2: none(1.0),cmpxchng(1.0),pg_lwlock(1.0),atomicinc(1.0),spin(1.2),pg_lwlock_cas(1.2)
wall-3: none(1.0),pg_lwlock(1.0),atomicinc(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
user-3: none(1.0),atomicinc(1.0),pg_lwlock(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3)
wall-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.2)
user-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.4)

There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention. I'm not even sure
that you can see this problem on a real workload on a 16-core system.
But as you add cores the problem gets rapidly worse, because the more
cores you have, the more likely it is that someone else is already
holding the spinlock (or has just very recently held the spinlock)
that you want. And, of course, the problem multiplies itself, because
once your lock acquistions start taking longer, they become a larger
percentage of your execution time, which makes it proportionately more
likely that you'll find someone already there when you go to grab the
lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Merlin Moncure

mmoncure@gmail.com

over 14 years ago

In reply to: Robert Haas (#12)

Re: spinlock contention

On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

merlin

#14

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Merlin Moncure (#13)

Re: spinlock contention

On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#14)

Re: spinlock contention

On Jun28, 2011, at 23:48 , Robert Haas wrote:

On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(

Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?

best regards,
Florian Pflug

#16

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#15)

Re: spinlock contention

On Tue, Jun 28, 2011 at 5:55 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun28, 2011, at 23:48 , Robert Haas wrote:

On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7)

I may not be following all this correctly, but doesn't this suggest a
huge potential upside for the cas based patch you posted upthread when
combined with your earlier patches that were bogging down on spinlock
contentionl?

Well, you'd think so, but in fact that patch makes it slower. Don't
ask me why, 'cuz I dunno. :-(

Huh? Where do you see your CAS patch being slower than unpatched LWLocks
in these results? Or are you referring to pgbench runs you made, not
to these artificial benchmarks?

pgbench -S

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#12)

Re: spinlock contention

On Jun28, 2011, at 22:18 , Robert Haas wrote:

On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote:

[ testing of various spinlock implementations ]

I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
loaner from Nate Boley:

Cool, thanks!

100 counter increments per cycle
worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06

Here's the same table, formatted with spaces.

worker 1 2 4 8 16 32
time wall user wall user wall user wall user wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06

And here's the throughput table calculated from your results,
i.e. the wall time per cycle relative to the wall time per cycle
for 1 worker.

workers 2 4 8 16 32
none 1.9 3.5 6.7 13 26
atomicinc 1.4 2.6 2.6 2.4 2.4
cmpxchng 1.1 1.1 1.2 0.9 0.8
spin 1.5 2.6 0.3 0.1 0.0
pg_lwlock 1.4 2.5 1.0 0.2 0.1
pg_lwlock_cas 1.3 2.6 2.6 1.9 1.5

Hm, so in the best case we get 2.6x the throughput of a single core,
and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7).
In that case, there also seems to be little difference between
pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least
sustain that throughput when the worker count is increased, while
for for the others the throughput actually *decreases*.

What totally puzzles me is that your results don't show any
trace of a decreased system load for the pg_lwlock implementation,
which I'd have expected due to the sleep() in the contested
path. Here are the user vs. wall time ratios - I'd have expected
to see value significantly below the number of workers for pg_lwlock

none 1.0 2.0 4.0 7.9 16 31
atomicinc 1.0 2.0 3.9 7.9 15 33
cmpxchng 1.0 2.0 4.1 7.9 16 31
spin 1.0 2.0 3.9 7.8 16 31
pg_lwlock 1.0 2.0 4.1 7.9 16 31
pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31

I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:

If you're OK with that, I'd like to add that to the lockbench
repo.

There seems to be something a bit funky in your 3-core data, but
overall I read this data to indicate that 4 cores aren't really enough
to see a severe problem with spinlock contention.

Hm, it starts to show if you lower the counter increment per cycle
(the D constant in run.sh). But yeah, it's never as bad as the
32-core results above..

best regards,
Florian Pflug

#18

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#17)

1 attachment(s)

Re: spinlock contention

On Tue, Jun 28, 2011 at 8:11 PM, Florian Pflug <fgp@phlo.org> wrote:

I wrote a little script to show to reorganize this data in a
possibly-easier-to-understand format - ordering each column from
lowest to highest, and showing each algorithm as a multiple of the
cheapest value for that column:

If you're OK with that, I'd like to add that to the lockbench
repo.

It's a piece of crap, but you're welcome to try to fix it up. See attached.

With respect to the apparent lack of impact of the sleep loop, I can
only speculate that this test case doesn't really mimic real-world
conditions inside the backend very well. There is a lot more going on
there - multiple locks, lots of memory access, etc. But I don't
really understand it either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Robert Haas (#1)

Re: spinlock contention

On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

I did some benchmarking, on the 32-core system from Nate Boley, with
pgbench -n -S -c 80 -j 80. With master+fastlock+lazyvxid, I can hit
180-200k TPS in the 32-client range. But at 80 clients throughput
starts to fall off quite a bit, dropping down to about 80k TPS. So
then, just for giggles, I inserted a "return;" statement at the top of
AcceptInvalidationMessages(), turning it into a noop. Performance at
80 clients shot up to 210k TPS. I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good. That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.

I did some other experiments, too. In the hopes of finding a general
way to reduce spinlock contention, I implemented a set of "elimination
buffers", where backends that can't get a spinlock go and try to
combine their requests and then send a designated representative to
acquire the spinlock and acquire shared locks simultaneously for all
group members. It took me a while to debug the code, and I still
can't get it to do anything other than reduce performance, which may
mean that I haven't found all the bugs yet, but I'm not feeling
encouraged. Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat. I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears. Then things are OK again for a bit until
the same thing happens to some other backend. This is just a theory,
I might be totally wrong...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#19)

Re: spinlock contention

On Jul7, 2011, at 03:35 , Robert Haas wrote:

On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jun12, 2011, at 23:39 , Robert Haas wrote:

So, the majority (60%) of the excess spinning appears to be due to
SInvalReadLock. A good chunk are due to ProcArrayLock (25%).

Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
so I guess that two adjacent LWLocks currently share one cache line.

Currently, the ProcArrayLock has index 4 while SInvalReadLock has
index 5, so if I'm not mistaken exactly the two locks where you saw
the largest contention on are on the same cache line...

Might make sense to try and see if these numbers change if you
either make LWLockPadded 64bytes or arrange the locks differently...

I fooled around with this a while back and saw no benefit. It's
possible a more careful test would turn up something, but I think the
only real way forward here is going to be to eliminate some of that
locking altogether.

I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good. That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.

I did some research and found "TLRW: Return of the Read-Write Lock"
by D. Dice and N. Shavit. PDF can be found here
http://transact09.cs.washington.edu/19_paper.pdf

They describe a read-write lock called "bytelock" with set of "slots"
for shared lockers, where each shared locker only needs to increment his
slot, and thus doesn't interfere with other concurrent shared locking
attempts. The price is that, after incrementing its slot, a shared
locker must issue a fencing instruction and then verify that no writer
has taken the lock in the mean while. In their application, they
distinguish between "slotted" threads - those who own a designated
slot, and "unslotted" thread - those who operation on the normal
shared counter.

I implemented that idea, but with a few modifications. First, I don't
distinguish between "slotted" and "unslotted" threads, but allow
multiple backends to share a slot. This means my implementation cannot
use a simple unlocked increment to update the slot, but instead uses
"locked xadd". I also moved the slots out of the lock itself, and into
a separate set of LWLockPart structures. Each such structure contains
the counters for one slot id and multiple locks.

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at
https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

Benchmarking results look promising so far. This is the through-put
I get, in acquisition/release cycles per second, for the LWLock
implementations on a 4-core machine. pg_lwlock_part,N,X is the
partitioned lock with N lock partitions and using LOCKED XADD if
X == yes. The numbers are normalized to the through-put in the
single-worker case.

---------- Cycles / Second (Wall-Time) ---------------------
1 2 4 8 16 worker
wall wall wall wall wall time
1 1.9 3.9 3.9 3.9 none
1 0.2 0.9 0.8 0.6 pg_lwlock_orig
1 0.4 0.3 0.3 0.3 pg_lwlock_cas
1 0.3 1.0 0.8 0.6 pg_lwlock_part,1,no
1 0.4 0.4 0.4 0.4 pg_lwlock_part,1,yes
1 2.0 0.4 0.4 0.3 pg_lwlock_part,2,no
1 2.0 0.7 0.8 0.7 pg_lwlock_part,2,yes
1 2.0 4.1 2.1 1.0 pg_lwlock_part,4,no
1 2.1 3.8 1.8 2.2 pg_lwlock_part,4,yes

These numbers show that as long is the number of workers does not
exceed the number of lock partitions, throughput increases approximately
linearly. It drops off afterwards, but I that's to be expected -
the test executes LQLockAcquire/Release in a tight loop, so once there's
any kind of contention (even cache-line bouncing), performance drops.
This is also why the original LWLock beats CAS and the partitioned
lock with less than 4 partitions here - effectively, at least half of
the workers are sleeping at any given time, as the following
user vs. wall time numbers show.

---------- User-Time / Wall-Time ----------------------------
1 2 4 8 16 worker
1.0 1.9 3.9 3.9 3.9 none
1.0 1.9 1.1 1.3 1.8 pg_lwlock_orig
1.0 2.0 4.0 4.0 4.0 pg_lwlock_cas
1.0 1.9 1.1 1.4 1.8 pg_lwlock_part,1,no
1.0 2.0 4.0 3.9 4.0 pg_lwlock_part,1,yes
1.0 2.0 3.6 3.8 3.6 pg_lwlock_part,2,no
1.0 2.0 3.8 3.9 3.9 pg_lwlock_part,2,yes
1.0 2.0 4.1 4.1 3.9 pg_lwlock_part,4,no
1.0 2.0 4.0 3.9 3.9 pg_lwlock_part,4,yes

I lack the hardware to produce any meaningful benchmark results
with pgbench for this, so again I'd be very cool if someone
(Robert? ;-) who has access to that kind of hardware could give the
patched version from github a spin with pgbench.

The patched version is current set of 4 shared counter partitions,
and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC).

I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears. Then things are OK again for a bit until
the same thing happens to some other backend. This is just a theory,
I might be totally wrong...

Hm, but one would expect most of the spin lock contesters to have
given up and gone to sleep by the time the interrupted lock holder
resumes. If that is not what happens, then maybe we should revisit
the logic in SpinLockAcquire()...

best regards,
Florian Pflug

#21

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#20)

Re: spinlock contention

On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote:

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at
https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

I think this is probably a good trade-off for locks that are most
frequently taken in shared mode (like SInvalReadLock), but it seems
like it could be a very bad trade-off for locks that are frequently
taken in both shared and exclusive mode (e.g. ProcArrayLock,
BufMappingLocks).

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#21)

1 attachment(s)

Re: spinlock contention

On Jul7, 2011, at 18:09 , Robert Haas wrote:

On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote:

In effect, the resulting thing is an LWLock with a partitioned shared
counter. The partition one backend operates on for shared locks is
determined by its backend id.

I've added the implementation to the lock benchmarking tool at
https://github.com/fgp/lockbench
and also pushed a patched version of postgres to
https://github.com/fgp/postgres/tree/lwlock_part

The number of shared counter partitions is current 4, but can easily
be adjusted in lwlock.h. The code uses GCC's atomic fetch and add
intrinsic if available, otherwise it falls back to using a per-partition
spin lock.

I think this is probably a good trade-off for locks that are most
frequently taken in shared mode (like SInvalReadLock), but it seems
like it could be a very bad trade-off for locks that are frequently
taken in both shared and exclusive mode (e.g. ProcArrayLock,
BufMappingLocks).

That's definitely a concern. One idea I had to alleviate that is to
add a per-partition "used" flag to the LWLock struct, and set that
to true (if not already set) before incrementing a partition's
shared counter. Exclusive lockers would then only need to inspect those
partitions for which the flag is set, and would clear all flags after
having acquires the lock.

I actually tried to do that initially, but figuring out where and how
to safely clear the flag again turned out to be a bit hairy. So I decided
to keep it simple first, and wait to see whether that problem manifests
itself in pratice.

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

best regards,
Florian Pflug

Attachments:

pg_lwlock_part.patchapplication/octet-stream; name=pg_lwlock_part.patchDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fe7ce4..9cf6513 100644
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
***************
*** 31,36 ****
--- 31,38 ----
  #include "storage/predicate.h"
  #include "storage/proc.h"
  #include "storage/spin.h"
+ #include "storage/backendid.h"
+ #include "storage/atomic.h"
  
  
  /* We use the ShmemLock spinlock to protect LWLockAssign */
*************** typedef struct LWLock
*** 42,48 ****
--- 44,52 ----
  	slock_t		mutex;			/* Protects LWLock and queue of PGPROCs */
  	bool		releaseOK;		/* T if ok to release waiters */
  	char		exclusive;		/* # of exclusive holders (0 or 1) */
+ #if LWLOCK_LOCK_PARTS == 1
  	int			shared;			/* # of shared holders (0..MaxBackends) */
+ #endif
  	PGPROC	   *head;			/* head of list of waiting PGPROCs */
  	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
  	/* tail is undefined when head is NULL */
*************** NON_EXEC_STATIC LWLockPadded *LWLockArra
*** 77,82 ****
--- 81,275 ----
  
  
  /*
+  * Partitioned shared counter case
+  */
+ #if LWLOCK_LOCK_PARTS > 1
+ 
+ /*
+  * Size of one LWLock shared counter partition in bytes.
+  * Choosen to fill exactly one cache line on common architectures.
+  */
+ #define LWLOCK_PART_SIZE 64
+ 
+ /*
+  * Number of locks per partition.
+  * If we don't have an atomic increment instruction, we fall back to using a
+  * per-partition mutex
+  */
+ #ifdef HAVE_ATOMIC_FETCH_ADD_FENCE_32
+ 
+ #define LWLOCK_PART_LOCKS (LWLOCK_PART_SIZE / sizeof(int))
+ 
+ #else /* !HAVE_ATOMIL_FETCH_ADD_FENCE_32 */
+ 
+ #define LWLOCK_PART_LOCKS ((LWLOCK_PART_SIZE - sizeof(slock_t)) / sizeof(int))
+ 
+ #endif /* HAVE_ATOMIC_FETCH_ADD_FENCE_32 */
+ 
+ /* Number of LWLockPart structs we need for nlocks locks. */
+ #define LWLOCK_PARTS(nlocks) (\
+ 	(1 + (nlocks-1) / LWLOCK_PART_LOCKS) * LWLOCK_LOCK_PARTS \
+ )
+ 
+ /*
+  * Shared-counter partition for LWLocks. Each Backend increments the
+  * shared counter in *its* partition when acquiring a LWLock in shared
+  * mode. We limit the size of one partition to the size of a cache-line,
+  * and therefore create LWLOCK_LOCK_PARTS * (<NLocks> / LWLOCK_PART_LOCKS)
+  * partition, not only LWLOCK_LOCK_PARTS partitions. The shared-partition
+  * array can thus be thought of as an two-dimensional array, indexed by
+  * (BackendId % LWLOCK_LOCK_PARTS) and (lockid / LWLOCK_PART_LOCKS).
+  *
+  * An int is quite excessive here, but making it any smaller would require
+  * an overflow check in the shared lock acquisition fast path. That in turn
+  * would require us to use a compare-and-exchange instead of an atomic
+  * increment there. So overall, using an int seems benefical, even if it
+  * means we'll waste a bit of processor cache.
+  */
+ typedef struct LWLockPart
+ {
+ #ifndef HAVE_ATOMIC_FETCH_ADD_FENCE_32
+ 	slock_t		mutex;
+ #endif
+ 	int			shared[LWLOCK_PART_LOCKS];
+ } LWLockPart;
+ 
+ NON_EXEC_STATIC LWLockPart *LWLockPartArray = NULL;
+ 
+ /*
+  * The resulting layout of LWLockPartArray is
+  * 
+  * <Lcks 0                 .. LWLOCK_PART_LOCKS-1,   Part 0>
+  * ...
+  * <Lcks 0                 .. LWLOCK_PART_LOCKS-1,   Part LWLOCK_LOCK_PARTS-1>
+  * <Lcks LWLOCK_PART_LOCKS .. 2*LWLOCK_PART_LOCKS-1, Part 0>
+  * ...
+  * <Lcks LWLOCK_PART_LOCKS .. 2*LWLOCK_PART_LOCKS-1, Part LWLOCK_LOCK_PARTS-1>
+  * ...
+  *
+  * where each line <Locks N .. M, Part P> contains the shared counter
+  * partition P for locks N to M, i.e. (N-M+1) individual counter.
+  */
+ #define LWLOCK_PART(lock, lockid, backendid) ( \
+ 	LWLockPartArray + ( \
+ 		((lockid) / LWLOCK_PART_LOCKS) * LWLOCK_LOCK_PARTS + \
+ 		((backendid) % LWLOCK_LOCK_PARTS) \
+ 	)) \
+ 
+ #ifdef HAVE_ATOMIC_FETCH_ADD_FENCE_32
+ 
+ #define LWLOCK_PART_SHARED_OPS_ATOMIC
+ 
+ #define LWLOCK_PART_SHARED_POSTINC_ATOMIC(lock, lockid, part, partid) \
+ 	AtomicFetchAddFence32((part)->shared + \
+ 	                      (lockid % LWLOCK_PART_LOCKS), \
+ 	                      1)
+ 
+ #define LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, partid) \
+ 	AtomicFetchAddFence32((part)->shared + \
+ 	                      (lockid % LWLOCK_PART_LOCKS), \
+ 	                      -1)
+ 
+ /* Taken care of by AtomicFetchAddFence32() */
+ #define LWLOCK_PART_SHARED_FENCE()
+ 
+ #else /* !HAVE_ATOMIC_FETCH_ADD_FENCE_32 */
+ 
+ #define LWLOCK_PART_SHARED_OPS_ATOMIC
+ 
+ #define LWLOCK_PART_SHARED_POSTINC_ATOMIC(lock, lockid, part, partid) \
+ 	xadd_atomic(&((part)->mutex), \
+ 	            (part)->shared + (lockid % LWLOCK_PART_LOCKS), \
+ 	            1)
+ 
+ #define LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, partid) \
+ 	xadd_atomic(&((part)->mutex), \
+ 	            (part)->shared + (lockid % LWLOCK_PART_LOCKS), \
+ 	            -1)
+ 	
+ /* *Not* taken care of by xadd_atomic() */
+ #define LWLOCK_PART_SHARED_FENCE() fence()
+ 
+ inline static int xadd_atomic(
+ 	volatile slock_t* mutex,
+ 	volatile int* ptr,
+ 	int delta
+ ) {
+ 	int 		previous;
+ 	
+ 	/* Atomic add, returning the old value */
+ 	SpinLockAcquire(mutex);
+ 	previous = *ptr;
+ 	*ptr += delta;
+ 	SpinLockRelease(mutex);
+ 	
+ 	return previous;
+ }
+ 
+ inline static void fence()
+ {
+ 	slock_t		fence;
+ 
+ 	/* For the lack of a better method ... */
+ 	SpinLockInit(&fence);
+ 	SpinLockAcquire(&fence);
+ 	SpinLockRelease(&fence);
+ }
+ 
+ #endif /* HAVE_ATOMIC_FETCH_ADD_FENCE_32 */
+ 
+ #define LWLOCK_IS_SHARED(result, lock, lockid) do { \
+ 		int i; \
+ 		result = false; \
+ 		for(i=0; i < LWLOCK_LOCK_PARTS; ++i) { \
+ 			volatile LWLockPart *p = LWLOCK_PART(lock, lockid, i); \
+ 			if (p->shared[lockid % LWLOCK_PART_LOCKS] == 0) \
+ 				continue; \
+ 			result = true; \
+ 			break; \
+ 		} \
+ 	} while(0)
+ 
+ 
+ #else /* LWLOCK_LOCK_PARTS == 1 */
+ 
+ /*
+  * Unpartitioned shared counter case
+  */
+ 
+ #ifdef HAVE_ATOMIC_FETCH_ADD_FENCE_32
+ 
+ #define LWLOCK_PART_SHARED_OPS_ATOMIC
+ 
+ #define LWLOCK_PART_SHARED_POSTINC_ATOMIC(lock, lockid, part, partid) \
+ 	AtomicFetchAddFence32(&(lock)->shared, \
+ 	                      1)
+ 
+ #define LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, partid) \
+ 	AtomicFetchAddFence32(&(lock)->shared, \
+ 	                      -1)
+ 
+ /* Taken care of by AtomicFetchAddFence32() */
+ #define LWLOCK_PART_SHARED_FENCE()
+ 
+ #else /* !HAVE_ATOMIC_FETCH_ADD_FENCE_32 */
+ 
+ /* We could of course provide that if we used the lock's mutex to make
+  * the action atomic. But then we'd take the lock's mutex twice during
+  * LWLock Acquisition. So instead, we special-case the situation of
+  * no atomic xadd and one shared counter partition in the implementations
+  * of lock acquisition and release
+  */
+ #undef LWLOCK_PART_SHARED_OPS_ATOMIC
+ 
+ #endif /* HAVE_ATOMIC_FETCH_ADD_FENCE_32 */
+ 
+ #define LWLOCK_IS_SHARED(result, lock, lockid) \
+ 		result = ((lock)->shared != 0)
+ 
+ #endif /* LWLOCK_LOCK_PARTS == 1 */
+ 
+ /*
   * We use this structure to keep track of locked LWLocks for release
   * during error recovery.  The maximum size could be determined at runtime
   * if necessary, but it seems unlikely that more than a few locks could
*************** NON_EXEC_STATIC LWLockPadded *LWLockArra
*** 86,91 ****
--- 279,285 ----
  
  static int	num_held_lwlocks = 0;
  static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];
+ static LWLockMode held_lwlocks_mode[MAX_SIMUL_LWLOCKS];
  
  static int	lock_addin_request = 0;
  static bool lock_addin_request_allowed = true;
*************** LWLockShmemSize(void)
*** 225,233 ****
  
  	/* Space for the LWLock array. */
  	size = mul_size(numLocks, sizeof(LWLockPadded));
! 
! 	/* Space for dynamic allocation counter, plus room for alignment. */
  	size = add_size(size, 2 * sizeof(int) + LWLOCK_PADDED_SIZE);
  
  	return size;
  }
--- 419,438 ----
  
  	/* Space for the LWLock array. */
  	size = mul_size(numLocks, sizeof(LWLockPadded));
! 	
! 	/* Space for dynamic allocation counter,
! 	 * plus room for alignment of LWLockArray.
! 	 */
  	size = add_size(size, 2 * sizeof(int) + LWLOCK_PADDED_SIZE);
+ 	
+ #if LWLOCK_PART_SIZE > 1
+ 	/* Space for the LWLockPart array */
+ 	size = add_size(size, mul_size(LWLOCK_PARTS(numLocks),
+ 	                               sizeof(LWLockPart)));
+ 	
+ 	/* Room for alignment of LWLockPartArray */
+ 	size = add_size(size, sizeof(LWLockPart));
+ #endif
  
  	return size;
  }
*************** CreateLWLocks(void)
*** 242,251 ****
--- 447,462 ----
  	int			numLocks = NumLWLocks();
  	Size		spaceLocks = LWLockShmemSize();
  	LWLockPadded *lock;
+ #if LWLOCK_PART_SIZE > 1
+ 	LWLockPart *part;
+ #endif
  	int		   *LWLockCounter;
  	char	   *ptr;
  	int			id;
  
+ 	/* Ensure that we didn't mess up the computation of LWLOCK_PART_LOCKS */
+ 	Assert(sizeof(LWLockPart) == LWLOCK_PART_SIZE);
+ 
  	/* Allocate space */
  	ptr = (char *) ShmemAlloc(spaceLocks);
  
*************** CreateLWLocks(void)
*** 256,275 ****
--- 467,507 ----
  	ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE;
  
  	LWLockArray = (LWLockPadded *) ptr;
+ 	ptr += sizeof(LWLockPadded) * numLocks;
+ 	
+ #if LWLOCK_LOCK_PARTS > 1
+ 	/* Ensure desired alignment of LWLockPart array */
+ 	ptr += LWLOCK_PART_SIZE - ((uintptr_t) ptr) % LWLOCK_PART_SIZE;
+ 	
+ 	LWLockPartArray = (LWLockPart *) ptr;
+ 	ptr += sizeof(LWLockPart) * LWLOCK_PARTS(numLocks);
+ #endif
  
  	/*
  	 * Initialize all LWLocks to "unlocked" state
  	 */
+ 
  	for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
  	{
  		SpinLockInit(&lock->lock.mutex);
  		lock->lock.releaseOK = true;
  		lock->lock.exclusive = 0;
+ #if LWLOCK_LOCK_PARTS == 1
  		lock->lock.shared = 0;
+ #endif
  		lock->lock.head = NULL;
  		lock->lock.tail = NULL;
  	}
  
+ #if LWLOCK_LOCK_PARTS > 1
+ 	for(id = 0, part = LWLockPartArray; id < LWLOCK_PARTS(numLocks); id++, part++) {
+ #ifndef LWLOCK_PART_SHARED_OPS_ATOMIC
+ 		SpinLockInit(&part->mutex);
+ #endif
+ 		memset((char *) part->shared, 0, sizeof(int) * LWLOCK_PART_LOCKS);
+ 	}
+ #endif
+ 
  	/*
  	 * Initialize the dynamic-allocation counter, which is stored just before
  	 * the first LWLock.
*************** void
*** 320,325 ****
--- 552,560 ----
  LWLockAcquire(LWLockId lockid, LWLockMode mode)
  {
  	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ #if LWLOCK_LOCK_PARTS > 1
+ 	volatile LWLockPart *part = LWLOCK_PART(lock, lockid, MyBackendId);
+ #endif
  	PGPROC	   *proc = MyProc;
  	bool		retry = false;
  	int			extraWaits = 0;
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 356,362 ****
  	/* Ensure we will have room to remember the lock */
  	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
  		elog(ERROR, "too many LWLocks taken");
! 
  	/*
  	 * Lock out cancel/die interrupts until we exit the code section protected
  	 * by the LWLock.  This ensures that interrupts will not interfere with
--- 591,597 ----
  	/* Ensure we will have room to remember the lock */
  	if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
  		elog(ERROR, "too many LWLocks taken");
! 		
  	/*
  	 * Lock out cancel/die interrupts until we exit the code section protected
  	 * by the LWLock.  This ensures that interrupts will not interfere with
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 383,423 ****
  	for (;;)
  	{
  		bool		mustwait;
! 
! 		/* Acquire mutex.  Time spent holding mutex should be short! */
! 		SpinLockAcquire(&lock->mutex);
! 
! 		/* If retrying, allow LWLockRelease to release waiters again */
! 		if (retry)
! 			lock->releaseOK = true;
! 
! 		/* If I can get the lock, do so quickly. */
! 		if (mode == LW_EXCLUSIVE)
  		{
! 			if (lock->exclusive == 0 && lock->shared == 0)
  			{
! 				lock->exclusive++;
  				mustwait = false;
  			}
  			else
  				mustwait = true;
  		}
  		else
  		{
  			if (lock->exclusive == 0)
  			{
! 				lock->shared++;
! 				mustwait = false;
  			}
  			else
  				mustwait = true;
  		}
! 
  		if (!mustwait)
  			break;				/* got the lock */
! 
  		/*
! 		 * Add myself to wait queue.
  		 *
  		 * If we don't have a PGPROC structure, there's no way to wait. This
  		 * should never occur, since MyProc should only be null during shared
--- 618,759 ----
  	for (;;)
  	{
  		bool		mustwait;
! 		
! 		if (mode == LW_SHARED)
  		{
! #ifdef		LWLOCK_PART_SHARED_OPS_ATOMIC
! 			/* Increment shared counter partition. If there's no contention,
! 			 * this is sufficient to take the lock
! 			 */
! 			LWLOCK_PART_SHARED_POSTINC_ATOMIC(lock, lockid, part, MyBackendId);
! 			LWLOCK_PART_SHARED_FENCE();
! 			
! 			/* A concurrent exclusive locking attempt does the following
! 			 * three steps
! 			 *   1) Acquire mutex
! 			 *   2) Check shared counter partitions for readers.
! 			 *   3a) If found add proc to wait queue, block, restart at (1)
! 			 *   3b) If not found, set exclusive flag, continue with (4)
! 			 *   4) Enter protected section
! 			 * The fence after the atomic add above ensures that no further
! 			 * such attempt can proceed to (3b) or beyond. There may be
! 			 * pre-existing exclusive locking attempts at step (3b) or beyond,
! 			 * but we can recognize those by either the mutex being taken, or
! 			 * the exclusive flag being set. Conversely, if we see neither, we
! 			 * may proceed and enter the protected section.
! 			 *
! 			 * FIXME: This doesn't work if slock_t is a struct or doesn't
! 			 * use 0 for state "unlocked".
! 			 */
! 
! 			if ((lock->mutex == 0) && (lock->exclusive == 0)) {
! 				/* If retrying, allow LWLockRelease to release waiters again.
! 				 * Usually this happens after we acquired the mutex, but if
! 				 * we skip that, we still need to set releaseOK.
! 				 *
! 				 * Acquiring the mutex here is not really an option - if many
! 				 * reader are awoken simultaneously by an exclusive unlock,
! 				 * that would be a source of considerable contention.
! 				 *
! 				 * Fotunately, this is safe even without the mutex. First,
! 				 * there actually cannot be any non-fast path unlocking
! 				 * attempt in progress, because we'd then either still see
! 				 * the exclusive flag set or the mutex being taken. And
! 				 * even if there was, and such an attempt cleared the flag
! 				 * immediately after we set it, it'd also wake up some waiter
! 				 * who'd then re-set the flag.
! 				 *
! 				 * The only reason to do this here, and not directly
! 				 * after returning from PGSemaphoreLock(), is that it seems
! 				 * benefical to make SpinLockAcquire() the first thing to
! 				 * touch the lock if possible, in case we acquire the spin
! 				 * lock at all. That way, the cache line doesn't go through
! 				 * a possible shared state, but instead directly to exclusive.
! 				 * On Opterons at least, there seems to be a difference, c.f.
! 				 * the comment above tas() for x86_64 in s_lock.h
! 				 */
! 				if (retry && !lock->releaseOK)
! 					lock->releaseOK = true;
! 					
! 				goto lock_acquired;
! 			}
! 				
! 			/* At this point, we don't know if the concurrent exclusive locker
! 			 * has proceeded to (3b) or blocked. We must take the mutex and
! 			 * re-check
! 			 */
! #endif /* LWLOCK_PART_SHARED_OPS_ATOMIC */
! 			
! 			/* Acquire mutex.  Time spent holding mutex should be short! */
! 			SpinLockAcquire(&lock->mutex);
! 			
! 			if (lock->exclusive == 0)
  			{
! #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
! 				/* Already incremented the shared counter partition above */
! #else
! 				lock->shared++;
! #endif
  				mustwait = false;
  			}
  			else
+ 			{
+ #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
+ 				/* Must undo shared counter partition increment. Note that
+ 				 * we *need* to do that while holding the mutex. Otherwise,
+ 				 * the exclusive lock could be released and attempted to be
+ 				 * re-acquired before we undo the increment. That attempt
+ 				 * would then block, even though there'd be no lock holder
+ 				 * left
+ 				 */
+ 				LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, MyBackendId);
+ #endif
  				mustwait = true;
+ 			}
  		}
  		else
  		{
+ 			/* Step (1). Acquire mutex. Time spent holding mutex should be
+ 			 *                          short!
+ 			 */
+ 			SpinLockAcquire(&lock->mutex);
+ 			
  			if (lock->exclusive == 0)
  			{
! 				/* Step (2). Check for shared lockers. This surely happens
! 				 * after (1), otherwise SpinLockAcquire() is broken. Lock
! 				 * acquire semantics demand that no load must be re-ordered
! 				 * from after a lock acquisition to before, for obvious
! 				 * reasons.
! 				 */
! 
! 				LWLOCK_IS_SHARED(mustwait, lock, lockid);
! 				
! 				if (!mustwait) {
! 					/* Step (3a). Set exclusive flag. This surely happens
! 					 * after (2) because it depends on the result of (2),
! 					 * no matter how much reordering is going on here.
! 					 */
! 					lock->exclusive++;
! 				}
  			}
  			else
  				mustwait = true;
  		}
! 		
! 		/* If retrying, allow LWLockRelease to release waiters again.
! 		 * This is also separately done in the LW_SHARED early exit case
! 		 * above, and in contrast to there we don't hold the mutex there.
! 		 * See the comment there for why this is safe
! 		 */
! 		if (retry)
! 			lock->releaseOK = true;
! 		
  		if (!mustwait)
  			break;				/* got the lock */
! 			
  		/*
! 		 * Step (3b). Add myself to wait queue.
  		 *
  		 * If we don't have a PGPROC structure, there's no way to wait. This
  		 * should never occur, since MyProc should only be null during shared
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 477,487 ****
  
  	/* We are done updating shared state of the lock itself. */
  	SpinLockRelease(&lock->mutex);
  
  	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
  
  	/* Add lock to list of locks held by this backend */
! 	held_lwlocks[num_held_lwlocks++] = lockid;
  
  	/*
  	 * Fix the process wait semaphore's count for any absorbed wakeups.
--- 813,835 ----
  
  	/* We are done updating shared state of the lock itself. */
  	SpinLockRelease(&lock->mutex);
+ 	
+ 	/* Step 4. Enter protected section. This surely happens after (3),
+ 	 * this time because lock release semantics demand that no store
+ 	 * must be moved from before a lock release to after the release,
+ 	 * again for obvious reasons
+ 	 */
+ 
+ #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
+ lock_acquired:
+ #endif
  
  	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(lockid, mode);
  
  	/* Add lock to list of locks held by this backend */
! 	held_lwlocks[num_held_lwlocks] = lockid;
! 	held_lwlocks_mode[num_held_lwlocks] = mode;
! 	++num_held_lwlocks;
  
  	/*
  	 * Fix the process wait semaphore's count for any absorbed wakeups.
*************** bool
*** 501,506 ****
--- 849,857 ----
  LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
  {
  	volatile LWLock *lock = &(LWLockArray[lockid].lock);
+ #if LWLOCK_LOCK_PARTS > 1
+ 	volatile LWLockPart *part = LWLOCK_PART(lock, lockid, MyBackendId);
+ #endif
  	bool		mustwait;
  
  	PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
*************** LWLockConditionalAcquire(LWLockId lockid
*** 516,541 ****
  	 */
  	HOLD_INTERRUPTS();
  
! 	/* Acquire mutex.  Time spent holding mutex should be short! */
! 	SpinLockAcquire(&lock->mutex);
! 
! 	/* If I can get the lock, do so quickly. */
! 	if (mode == LW_EXCLUSIVE)
  	{
! 		if (lock->exclusive == 0 && lock->shared == 0)
  		{
! 			lock->exclusive++;
  			mustwait = false;
  		}
  		else
  			mustwait = true;
  	}
  	else
  	{
  		if (lock->exclusive == 0)
  		{
! 			lock->shared++;
! 			mustwait = false;
  		}
  		else
  			mustwait = true;
--- 867,960 ----
  	 */
  	HOLD_INTERRUPTS();
  
! 	if (mode == LW_SHARED)
  	{
! #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
! 		/* Increment shared counter partition. If there's no contention,
! 		 * this is sufficient to take the lock
! 		 */
! 		LWLOCK_PART_SHARED_POSTINC_ATOMIC(lock, lockid, part, MyBackendId);
! 		LWLOCK_PART_SHARED_FENCE();
! 
! 		/* A concurrent exclusive locking attempt does the following
! 		 * three steps
! 		 *   1) Acquire mutex
! 		 *   2) Check shared counter partitions for readers.
! 		 *   3a) If found add proc to wait queue, block, restart at (1)
! 		 *   3b) If not found, set exclusive flag, continue with (4)
! 		 *   4) Enter protected section
! 		 * The fence after the atomic add above ensures that no further
! 		 * such attempt can proceed to (3b) or beyond. There may be
! 		 * pre-existing exclusive locking attempts at step (3b) or beyond,
! 		 * but we can recognize those by either the mutex being taken, or
! 		 * the exclusive flag being set. Conversely, if we see neither, we
! 		 * may proceed and enter the protected section.
! 		 *
! 		 * FIXME: This doesn't work if slock_t is a struct or doesn't
! 		 * use 0 for state "unlocked".
! 		 */
! 
! 		if ((lock->mutex == 0) && (lock->exclusive == 0))
! 			goto lock_acquired;
! 
! 		/* At this point, we don't know if the concurrent exclusive locker
! 		 * has proceeded to (3b) or blocked. We must take the mutex and
! 		 * re-check
! 		 */
! #endif /* LWLOCK_PART_SHARED_OPS_ATOMIC */
! 
! 		/* Acquire mutex.  Time spent holding mutex should be short! */
! 		SpinLockAcquire(&lock->mutex);
! 
! 		if (lock->exclusive == 0)
  		{
! #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
! 			/* Already incremented the shared counter partition above */
! #else
! 			lock->shared++;
! #endif
  			mustwait = false;
  		}
  		else
+ 		{
+ #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
+ 			/* Must undo shared counter partition increment. Note that
+ 			 * we *need* to do that while holding the mutex. Otherwise,
+ 			 * the exclusive lock could be released and attempted to be
+ 			 * re-acquired before we undo the increment. That attempt
+ 			 * would then block, even though there'd be no lock holder
+ 			 * left
+ 			 */
+ 			LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, MyBackendId);
+ #endif
  			mustwait = true;
+ 		}
  	}
  	else
  	{
+ 		/* Step (1). Acquire mutex. Time spent holding mutex should be
+ 		 *                          short!
+ 		 */
+ 		SpinLockAcquire(&lock->mutex);
+ 
  		if (lock->exclusive == 0)
  		{
! 			/* Step (2). Check for shared lockers. This surely happens
! 			 * after (1), otherwise SpinLockAcquire() is broken. Lock
! 			 * acquire semantics demand that no load must be re-ordered
! 			 * from after a lock acquisition to before, for obvious
! 			 * reasons.
! 			 */
! 
! 			LWLOCK_IS_SHARED(mustwait, lock, lockid);
! 
! 			if (!mustwait) {
! 				/* Step (3a). Set exclusive flag. This surely happens
! 				 * after (2) because it depends on the result of (2),
! 				 * no matter how much reordering is going on here.
! 				 */
! 				lock->exclusive++;
! 			}
  		}
  		else
  			mustwait = true;
*************** LWLockConditionalAcquire(LWLockId lockid
*** 550,564 ****
  		RESUME_INTERRUPTS();
  		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, "failed");
  		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(lockid, mode);
! 	}
! 	else
! 	{
! 		/* Add lock to list of locks held by this backend */
! 		held_lwlocks[num_held_lwlocks++] = lockid;
! 		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
  	}
  
! 	return !mustwait;
  }
  
  /*
--- 969,990 ----
  		RESUME_INTERRUPTS();
  		LOG_LWDEBUG("LWLockConditionalAcquire", lockid, "failed");
  		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(lockid, mode);
! 		
! 		return false;
  	}
  
! #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
! lock_acquired:
! #endif
! 
! 	TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(lockid, mode);
! 
! 	/* Add lock to list of locks held by this backend */
! 	held_lwlocks[num_held_lwlocks] = lockid;
! 	held_lwlocks_mode[num_held_lwlocks] = mode;
! 	++num_held_lwlocks;
! 
! 	return true;
  }
  
  /*
*************** void
*** 568,575 ****
  LWLockRelease(LWLockId lockid)
  {
  	volatile LWLock *lock = &(LWLockArray[lockid].lock);
! 	PGPROC	   *head;
  	PGPROC	   *proc;
  	int			i;
  
  	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
--- 994,1005 ----
  LWLockRelease(LWLockId lockid)
  {
  	volatile LWLock *lock = &(LWLockArray[lockid].lock);
! #if LWLOCK_LOCK_PARTS > 1
! 	volatile LWLockPart *part = LWLOCK_PART(lock, lockid, MyBackendId);
! #endif
! 	PGPROC	   *head = NULL;
  	PGPROC	   *proc;
+ 	LWLockMode	mode;
  	int			i;
  
  	PRINT_LWDEBUG("LWLockRelease", lockid, lock);
*************** LWLockRelease(LWLockId lockid)
*** 578,583 ****
--- 1008,1014 ----
  	 * Remove lock from list of locks held.  Usually, but not always, it will
  	 * be the latest-acquired lock; so search array backwards.
  	 */
+ 	 
  	for (i = num_held_lwlocks; --i >= 0;)
  	{
  		if (lockid == held_lwlocks[i])
*************** LWLockRelease(LWLockId lockid)
*** 585,604 ****
  	}
  	if (i < 0)
  		elog(ERROR, "lock %d is not held", (int) lockid);
  	num_held_lwlocks--;
! 	for (; i < num_held_lwlocks; i++)
  		held_lwlocks[i] = held_lwlocks[i + 1];
  
! 	/* Acquire mutex.  Time spent holding mutex should be short! */
! 	SpinLockAcquire(&lock->mutex);
! 
! 	/* Release my hold on lock */
! 	if (lock->exclusive > 0)
! 		lock->exclusive--;
! 	else
! 	{
! 		Assert(lock->shared > 0);
  		lock->shared--;
  	}
  
  	/*
--- 1016,1105 ----
  	}
  	if (i < 0)
  		elog(ERROR, "lock %d is not held", (int) lockid);
+ 	
+ 	mode = held_lwlocks_mode[i];
+ 	
  	num_held_lwlocks--;
! 	for (; i < num_held_lwlocks; i++) {
  		held_lwlocks[i] = held_lwlocks[i + 1];
+ 		held_lwlocks_mode[i] = held_lwlocks_mode[i + 1];
+ 	}
+ 		
+ 	if (mode == LW_SHARED) {
+ #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
+ 		int shared_pre;
  
! 		/* Release my hold on lock */
! 		Assert(lock->exclusive == 0);
! 		shared_pre = LWLOCK_PART_SHARED_POSTDEC_ATOMIC(lock, lockid, part, MyBackendId);
! 		Assert(shared_pre > 0);
! 		
! 		/* If the count didn't drop to zero (i.e., there are more lockers
! 		 * using the same shared counter partition), we can leave waiting
! 		 * up blocked exclusive locking attempts to them. Note that there
! 		 * may also be shared lockers using a *different* partition, so
! 		 * we're not necessarily the last share lockers, even if we continue.
! 		 * Still, it's an easy optimization, so we got for it
! 		 */
! 		if (shared_pre > 1)
! 			goto lock_released;
! 			
! 		LWLOCK_PART_SHARED_FENCE();
! 		
! 		/* A concurrent exclusive locking attempt does the following
! 		 * three steps
! 		 *   1) Acquire mutex
! 		 *   2) Check shared counter partitions for readers.
! 		 *   3a) If found add proc to wait queue, block, restart at (1)
! 		 *   3b) If not found, set exclusive flag, continue with (4)
! 		 *   4) Enter protected section
! 		 * Assume now that we're that last share lock holder. Then, the
! 		 * fence after the atomic add above ensures that no further such
! 		 * concurrent exclusive locking attempts will proceed to (3a) and
! 		 * thus block. There may be such attempts currently blocking or
! 		 * about to block, but we can recognize those by either wait queue
! 		 * being non-empty or the mutex being taken. Conversely, if we see
! 		 * neither, we may assume that nobody needs to be signalled.
! 		 *
! 		 * Note that if two shared lockers release their lock while
! 		 * an exclusive locking attempt is in progress, both may decide
! 		 * they need to signal here. Taking the mutex below will sort that
! 		 * out, but it's a bit unfortunate that they have to race for the
! 		 * mutex here. Also, taking the mutex will force *other* shared
! 		 * lockers to take the mutex also in their release path.
! 		 * XXX: We may be able to improve that if we could dinstinguish
! 		 * been mutexed held for the purpose of unlocking and mutexes
! 		 * held for the purpose of locking.
! 		 *
! 		 * FIXME: This doesn't work if slock_t is a struct.
! 		 */
! 		
! 		if ((lock->mutex == 0) && (lock->head == NULL))
! 			goto lock_released;
! 			
! 		/* At this point, we don't know if the concurrent exclusive locker
! 		 * has seen on-zero in our shared counter partition in his step
! 		 * (2) or not. We must thus take the mutex and re-check.
! 		 */
! #endif /* LWLOCK_PART_SHARED_OPS_ATOMIC */
! 		
! 		/* Acquire mutex.  Time spent holding mutex should be short! */
! 		SpinLockAcquire(&lock->mutex);
! 		
! #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
! 		/* Already decremented the shared counter partition above */
! #else
! 		/* Release my hold on lock */
  		lock->shared--;
+ #endif
+ 	}
+ 	else {
+ 		/* Acquire mutex.  Time spent holding mutex should be short! */
+ 		SpinLockAcquire(&lock->mutex);
+ 		
+ 		/* Release my hold on lock */
+ 		lock->exclusive--;
+ 		Assert(lock->exclusive == 0);
  	}
  
  	/*
*************** LWLockRelease(LWLockId lockid)
*** 610,616 ****
  	head = lock->head;
  	if (head != NULL)
  	{
! 		if (lock->exclusive == 0 && lock->shared == 0 && lock->releaseOK)
  		{
  			/*
  			 * Remove the to-be-awakened PGPROCs from the queue.  If the front
--- 1111,1124 ----
  	head = lock->head;
  	if (head != NULL)
  	{
! 		bool		is_shared;
! 		
! 		if (mode == LW_SHARED)
! 			LWLOCK_IS_SHARED(is_shared, lock, lockid);
! 		else
! 			is_shared = false;
! 			
! 		if (lock->exclusive == 0 && !is_shared && lock->releaseOK)
  		{
  			/*
  			 * Remove the to-be-awakened PGPROCs from the queue.  If the front
*************** LWLockRelease(LWLockId lockid)
*** 640,645 ****
--- 1148,1157 ----
  	/* We are done updating shared state of the lock itself. */
  	SpinLockRelease(&lock->mutex);
  
+ #ifdef LWLOCK_PART_SHARED_OPS_ATOMIC
+ lock_released:
+ #endif
+ 
  	TRACE_POSTGRESQL_LWLOCK_RELEASE(lockid);
  
  	/*
diff --git a/src/include/storage/atomic.h b/src/include/storage/atomic.h
index ...c9216e3 .
*** a/src/include/storage/atomic.h
--- b/src/include/storage/atomic.h
***************
*** 0 ****
--- 1,18 ----
+ #ifndef ATOMIC_H
+ #define ATOMIC_H
+ 
+ #if ( \
+ 	(defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__ >= 401 )) || \
+ 	defined(__INTEL_COMPILER) \
+ ) \
+ 
+ #define HAVE_ATOMIC_FETCH_ADD_FENCE_32
+ 
+ static __inline__ uint16 AtomicFetchAddFence32(volatile int* ptr, int delta)
+ {
+ 	return __sync_fetch_and_add(ptr, delta);
+ }
+ 
+ #endif /* GCC >= 4.1 or ICC */
+ 
+ #endif /* ATOMIC_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 438a48d..2feed97 100644
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 20,25 ****
--- 20,28 ----
   * this file include lock.h or bufmgr.h would be backwards.
   */
  
+ /* Number of shared counter partitions per LWLock */
+ #define LWLOCK_LOCK_PARTS 4
+ 
  /* Number of partitions of the shared buffer mapping hashtable */
  #define NUM_BUFFER_PARTITIONS  16

#23

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Florian Pflug (#22)

Re: spinlock contention

Florian Pflug <fgp@phlo.org> writes:

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

That's already sufficient reason to reject the patch. Not everyone
uses gcc, let alone very recent versions of gcc.

regards, tom lane

#24

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Tom Lane (#23)

Re: spinlock contention

On Jul8, 2011, at 16:21 , Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

That's already sufficient reason to reject the patch. Not everyone
uses gcc, let alone very recent versions of gcc.

This is a WIP version meant for testing, not a finish patch!

Spending time on making this work on every conceivable compiler before we
even know whether or not the approach is worthwhile at all seems ludicrous
to me.

A finished version would use inline assembly to avoid the GCC version
dependency, and would support as many additional compilers as there are
people with access to these compilers who offer to help...

But yeah, that will very probably leave some compilers unsupported
(in the "fall back to spin lock per partition sense. Which, if the patch
proves worthwhile at all, probably still provides a benefit over the current
code).

If that is reason enough to reject the patch, i.e. if the policy is "we
don't want it for any if we cannot have it for all", then consider it
withdrawn.

best regards,
Florian Pflug

#25

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

over 14 years ago

In reply to: Tom Lane (#23)

Re: spinlock contention

On 07/08/2011 04:21 PM, Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

That's already sufficient reason to reject the patch. Not everyone
uses gcc, let alone very recent versions of gcc.

hmm - 4.1.0 was released in february 2006, which will be much older than
even the 5 year support policy we have on pg when 9.2 will be released,
not sure how much it will matter if we don't support as specific
optimization on a gcc that old..

Stefan

#26

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Stefan Kaltenbrunner (#25)

Re: spinlock contention

On Jul8, 2011, at 22:27 , Stefan Kaltenbrunner wrote:

On 07/08/2011 04:21 PM, Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

That's already sufficient reason to reject the patch. Not everyone
uses gcc, let alone very recent versions of gcc.

hmm - 4.1.0 was released in february 2006, which will be much older than
even the 5 year support policy we have on pg when 9.2 will be released,
not sure how much it will matter if we don't support as specific
optimization on a gcc that old..

Still, it's not really hard to support older Versions, at least on
x86 and x86-64. All it takes is some inline assembly. I just don't
want to put effort into this until we know whether or not the whole
approach is worthwhile or not.

Should someone want to test this patch, but can't because of the GCC
version restriction, please speak up.

best regards,
Florian Pflug

#27

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#19)

Re: spinlock contention

On Jul7, 2011, at 03:35 , Robert Haas wrote:

Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat. I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears.

Pondering this some more, I came up with another idea, pretty much
orthogonal to the shared counter partitioning I posted a patch for.

If indeed that problem isn't spin lock contention, but rather losing
control of the CPU while holding a spin lock, then why not try to get
rid of the spin lock entirely? On Linux, that's easy - this is exactly
what futexes are about. But it occurred to me that kernel support isn't
actually needed to do that - futexes can effectively be emulated in
user-space using just a semaphore, and without doing a syscall except
when there's contention.

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)

LockAcquire:
(1) CAS the lock state to increment the reader count or set
the exclusive bit in a loop while the lock looks free.
If successful, we're done, otherwise we continue with (2)
(2) Add ourself to the wait queue
[ Since we've added ourself to the queue, we *must* now
decrement the semaphore no matter what, to keep the
increment/decrement calls balanced. We're careful to
maintain that invariant below. ]
(3) Fence (AKA issue full memory barrier)
(4) Re-check if the lock still looks taken. If it does,
we decrement the semaphore (PGSemaphoreLock), and
(upon wake-up) restart at (1).
Otherwise, continue with (5)
(5) Remove the first waiter from the queue and increment
her semaphore. Rinse-and-repeat until we either removed
ourself from the queue or the queue is empty.
(6) Decrement our semaphore.
[ (6) is necessary in the general case, see the remark
below (2). But we can of course detect the case were
we'd increment our own semaphore in (5) only to
decrement it again in (6), and skip both operations ]

LockRelease:
(1) Set the lock state to 0, i.e. release the lock.
(2) Fence (AKA issue full memory barrier)
(3) If the lock still looks free, remove the first waiter from
the queue and increment her semaphore. Rinse-and-repeat
while the lock looks free and the queue is non-empty.
[ From a correctness POV, we only have to wake up one
waiter here, and that only if there isn't one who was been
woken up but hasn't yet retried to take the lock. In reality,
the decision if and how many waiter to wake up would depend
on their desired lock level, and some variant of what we
currently call releaseOK. ]

The fencing steps (LockAcquire:3) and (LockRelease:2) guarantee
that if we block in LockAcquire() a lock holder will see our
queue entry and thus will wake us up eventually.

Because we use a semaphore and not, say, a simple signal, we don't
have to worry about the precise ordering of block and unblock
operations - we just need to ensure they're balanced.

Now to to enqueue() and dequeue() primitives that the algorithm
above depends on. There are multiple published algorithms
for lock-free queues. Some googling turned up

"An optimistic approach to lock-free FIFO queues",
E.L. Mozes, N. Shavit,
DOI 10.1007/s00446-007-0050-0

and

"CAS-Based Lock-Free Algorithm for Shared Deques",
M.M. Michael

The second one looks promising, since it only requires a single
CAS to enqueue and dequeue entries in the common case. Thus, it
should be just as efficient as our current spin-lock-based queue
in the uncontended case (and much better in the contested case, of
course).

[ Our case is, in fact, simpler than the generic setting that these
algorithms support. We only ever enqueue our *own* proc array entry
(so allocation and re-use of queue entries isn't much of an issue),
and always *process* the queue after enqueuing something - either
directly in LockAcquire or later in LockRelease. We thus don't really
need to support concurrent removal of queue entries, but might get
away with simply skipping queue processing if we detect that somebody
else is in the process of doing so. I think I have an idea for a
simpler lock-less queue that fits our needs, but I haven't yet
ironed out all the details (As it turns out, removing the very last
entry from the queue is the hard case). ]

Thoughts, comments and criticism are all very welcome!

best regards,
Florian Pflug

#28

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#27)

Re: spinlock contention

On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)

This is similar to the CAS-based LWLocks I played around with, though I didn't use a lock-free queue. I think I probably need to devote some time to figuring out why that didn't work out to a win...

...Robert

#29

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#28)

Re: spinlock contention

On Jul13, 2011, at 00:10 , Robert Haas wrote:

On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)

This is similar to the CAS-based LWLocks I played around with, though
I didn't use a lock-free queue. I think I probably need to devote some
time to figuring out why that didn't work out to a win...

Yeah, the non-waitqueue-related parts are mostly identical. The only
significant difference there is that the waiters-present bit is replaced
by fence-and-recheck.

What motivated me to research this was your theory that the problem is
not so much spin-lock contention, but rather those unlucky processes who
lose their time-slice while holding the lock.

If that is true, then maybe the problem with your CAS patch is that
once the LWLocks is contested heavily, the waiters-present bit will be
set pretty much all the time, and the code will thus fall back to
using the spin-lock.

*Reading the code again*

Hm, I just realized that you only clear the waiters-present bit after
emptying the queue completely. With rising contention, you might reach a
point where you never empty the queue completely (unless the lock is
only ever acquired in share mode, that is). As it stands, the CAS patch
is effectively neutralized at this point. It'll even have an adverse
effect due to the higher number of atomic operations compared to the
unpatched version.

I wonder if clearing the waiters-present bit only upon clearing the
queue completely is necessary for correctness. Wouldn't it be OK
to clear the bit after waking up at least one locker? That'd still
guarantee that you cannot end up with a blocked process but no lock
holder, I believe.

best regards
Florian Pflug

#30

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#29)

Re: spinlock contention

On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote:

On Jul13, 2011, at 00:10 , Robert Haas wrote:

On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote:

The algorithm is quite straight forward, if one assumes a lock-free
implementation of a queue (More on that below)

This is similar to the CAS-based LWLocks I played around with, though
I didn't use a lock-free queue. I think I probably need to devote some
time to figuring out why that didn't work out to a win...

Yeah, the non-waitqueue-related parts are mostly identical. The only
significant difference there is that the waiters-present bit is replaced
by fence-and-recheck.

What motivated me to research this was your theory that the problem is
not so much spin-lock contention, but rather those unlucky processes who
lose their time-slice while holding the lock.

If that is true, then maybe the problem with your CAS patch is that
once the LWLocks is contested heavily, the waiters-present bit will be
set pretty much all the time, and the code will thus fall back to
using the spin-lock.

*Reading the code again*

Hm, I just realized that you only clear the waiters-present bit after
emptying the queue completely. With rising contention, you might reach a
point where you never empty the queue completely (unless the lock is
only ever acquired in share mode, that is). As it stands, the CAS patch
is effectively neutralized at this point. It'll even have an adverse
effect due to the higher number of atomic operations compared to the
unpatched version.

Hmm, yeah.

I wonder if clearing the waiters-present bit only upon clearing the
queue completely is necessary for correctness. Wouldn't it be OK
to clear the bit after waking up at least one locker? That'd still
guarantee that you cannot end up with a blocked process but no lock
holder, I believe.

I don't think so - how does the next process that releases the lock know to release waiters?

...Robert

#31

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#30)

Re: spinlock contention

On Jul13, 2011, at 22:04 , Robert Haas wrote:

On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote:

I wonder if clearing the waiters-present bit only upon clearing the
queue completely is necessary for correctness. Wouldn't it be OK
to clear the bit after waking up at least one locker? That'd still
guarantee that you cannot end up with a blocked process but no lock
holder, I believe.

I don't think so - how does the next process that releases the lock
know to release waiters?

It won't :-(. Not that easily, at least.

The idea stemmed from that fact that my shared counter partitioning
patch get away with something similar. But that only works because it
always re-checks for possible interference after doing the fast path,
so the idea isn't directly applicable to your patch it seems.

Porting that idea to your CAS patch would probably make it nearly
identical to the shared-counter partitioning patch with the number of
partitions set to 1, so I see little point in that.

BTW, I think I got a workable atomic queue that suits our needs sketched
up - it'll post the details once I've convinced myself that it's actually
correct. So should you believe that approach to be unworkable for some
reason, please speak up and same me the effort.

best regards,
Florian Pflug

#32

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Florian Pflug (#22)

Re: spinlock contention

On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote:

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

[ Back from vacation, catching up on email. ]

gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

pgbench -n -S -T 180 -c 32 -j 32

with lwlock-part patch:
tps = 36974.644091 (including connections establishing)

unpatched cd34647c666be867f95ef8fc0492c30356043f10:
tps = 39432.064293 (including connections establishing)

And with -c 8 -j 8:

tps = 26946.202428 (including connections establishing)
tps = 27206.507424 (including connections establishing)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#33

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Robert Haas (#32)

Re: spinlock contention

On Jul18, 2011, at 04:36 , Robert Haas wrote:

On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote:

I don't want to fiddle with your git repo, but if you attach a patch
that applies to the master branch I'll give it a spin if I have time.

Patch attached.

Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition
spin lock instead of "locked xadd" to increment the shared counters.

[ Back from vacation, catching up on email. ]

gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5)

pgbench -n -S -T 180 -c 32 -j 32

with lwlock-part patch:
tps = 36974.644091 (including connections establishing)

unpatched cd34647c666be867f95ef8fc0492c30356043f10:
tps = 39432.064293 (including connections establishing)

And with -c 8 -j 8:

tps = 26946.202428 (including connections establishing)
tps = 27206.507424 (including connections establishing)

:-( That's disappointing, to say the least.

I also completely fail to understand what the heck is going on there.

I mean, you did conclusively prove that commenting out the SInval stuff
made a huge difference. There's also supposed to hardly any invalidation
going on during a pgbench -S run. So, since the patch removes two of the
three spin-lock acquisitions from SIGetDataEntries() (so long as there are
no exclusive lockers of SInvalReadLock), there should be some effect.
Or so I'd think at least...

If anyone has I theory, I'd love to hear it.

best regards,
Florian Pflug