Wait free LW_SHARED acquisition

Started by Andres Freundover 12 years ago117 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hello,

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

That's rougly 400%.

So, nice little improvement. Unless - not entirely unlikely - I fucked
up and it's fast because it's broken.

Anyway, here's the algorith I chose to implement:
The basic idea is to have a single 'uint32 lockcount' instead of the
former 'char exclusive' and 'int shared' and to use an atomic increment
to acquire the lock. That's fairly easy to do for rw-spinlocks, but a
lot harder for something like LWLocks that want to wait in the OS.

Exlusive lock acquisition:

Use an atomic compare-and-exchange on the lockcount variable swapping in
EXCLUSIVE_LOCK/1<<31/0x80000000 if and only if the current value of
lockcount is 0. If the swap was not successfull, we have to wait.

Shared lock acquisition:

Use an atomic add (lock xadd) to the lockcount variable to add 1. If the
value is bigger than EXCLUSIVE_LOCK we know that somebody actually has
an exclusive lock, and we back out by atomically decrementing by 1
again.
If so, we have to wait for the exlusive locker to release the lock.

Queueing & Wakeup:

Whenever we don't get a shared/exclusive lock we us nearly the same
queuing mechanism as we currently do. While we probably could make it
lockless as well, the queue currently is still protected by the good old
spinlock.

Relase:

Use a atomic decrement to release the lock. If the new value is zero (we
get that atomically), we know we have to release waiters.

And the real world:

Now, as you probably have noticed, naively doing the above has two
glaring race conditions:

1) too-quick-for-queueing:
We try to lock using the atomic operations and notice that we have to
wait. Unfortunately until we have finished queuing, the former locker
very well might have already finished it's work.

2) spurious failed locks:
Due to the logic of backing out of shared locks after we unconditionally
added a 1 to lockcount, we might have prevented another exclusive locker
from getting the lock:
1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success
2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1
3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK
4) Session B: LWLockRelease()
5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait.
6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1
7) Session B: LWLockAcquire(LW_SHARED) - wait

So now we can have both B) and C) waiting on a lock that nobody is
holding anymore. Not good.

The solution:
We use a two phased attempt at locking:
Phase 1: Try to do it atomically, if we succeed, nice
Phase 2: Add us too the waitqueue of the lock
Phase 3: Try to grab the lock again, if we succeed, remove ourselves
from the queue
Phase 4: Sleep till wakeup, goto Phase 1

This protects us against both problems from above:
1) Nobody can release too quick, before we're queued, after Phase 2 since we're already
queued.
2) If somebody spuriously got blocked from acquiring the lock, they will
get queued in Phase 2 and we can wake them up if neccessary.

Now, there are lots of tiny details to handle additionally to those, but
those seem better handled by looking at the code?
- The above algorithm only works for LWLockAcquire, not directly for
LWLockAcquireConditional where we don't want to wait. In that case we
just need to retry acquiring the lock until we're sure we didn't
disturb anybody in doing so.
- we can get removed from the queue of waiters in Phase 3, before we remove
ourselves. In that case we need to absorb the wakeup.
- Spurious locks can prevent us from recognizing a lock that's free
during release. Solve it by checking for existing waiters whenever an
exlusive lock is released.

I've done a couple of off-hand benchmarks and so far I can confirm that
everything using lots of shared locks benefits greatly and everything
else doesn't really change much. So far I've seen mostly some slight
improvements in exclusive lock heavy workloads, but those were pretty
small.
It's also very important to mention that those speedups are only there
on multi-socket machines. From what I've benchmarked so far in LW_SHARED
heavy workloads with 1 socket you get ~5-10%, 2 sockets 20-30% and
finally and nicely for 4 sockets: 350-400%.
While I did assume the difference would be bigger on 4 socket machines
than on my older 2 socket workstation (that's where the 20-30% come
from) I have to admit, I was surprised by the difference on the 4 socket
machine.

Does anybody see fundamental problems with the algorithm? The
implementation sure isn't ready for several reasons, but I don't want to
go ahead and spend lots of time on it before hearing some more voices.

So what's todo? The file header tells us:
* - revive pure-spinlock implementation
* - abstract away atomic ops, we really only need a few.
* - CAS
* - LOCK XADD
* - convert PGPROC->lwWaitLink to ilist.h slist or even dlist.
* - remove LWLockWakeup dealing with MyProc
* - overhaul the mask offsets, make SHARED/EXCLUSIVE_LOCK_MASK wider, MAX_BACKENDS

Currently only gcc is supported because I used its
__sync_fetch_and_add(), __sync_fetch_and_sub() and
__sync_val_compare_and_swap() are used. There have been reports about
__sync_fetch_and_sub() not getting properly optimized with gcc < 4.8,
perhaps we need to replace it by _and_add(-val). Given the low amount of
primitives required, it should be adaptable to most newer compilers.

Comments? Fundamental flaws? 8 socket machines?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v01-Wait-free-LW_SHARED-lwlock-acquiration.patchtext/x-patch; charset=us-asciiDownload+617-256
In reply to: Andres Freund (#1)
Re: Wait free LW_SHARED acquisition

On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

That's unfortunate. I saw someone complain about what sounds like
exactly the same issue on Twitter yesterday:

https://twitter.com/Roguelazer/status/382706273927446528

I tried to engage with him, but was unsuccessful.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#2)
Re: Wait free LW_SHARED acquisition

On 2013-09-26 16:56:30 -0700, Peter Geoghegan wrote:

On Thu, Sep 26, 2013 at 3:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

That's unfortunate. I saw someone complain about what sounds like
exactly the same issue on Twitter yesterday:

Well, fortunately there's a solution, as presented here ;)

There's another bottleneck in the heaps of PinBuffer() calls in such
workloads, that present themselves after fixing the lwlock contention,
at least in my tests. I think I see a solution there, but let's fix this
first though ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#1)
Re: Wait free LW_SHARED acquisition

On 27.09.2013 01:55, Andres Freund wrote:

Hello,

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#4)
Re: Wait free LW_SHARED acquisition

Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:

On 27.09.2013 01:55, Andres Freund wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I can run a test on the 4 socket machine if it's unused, but on my 2
socket workstation the benefits of at least our simulation of the
original workloads the improvements were marginal after increasing the
padding to a full cacheline.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
Re: Wait free LW_SHARED acquisition

On 2013-09-27 09:21:05 +0200, Andres Freund wrote:

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I can run a test on the 4 socket machine if it's unused, but on my 2
socket workstation the benefits of at least our simulation of the
original workloads the improvements were marginal after increasing the
padding to a full cacheline.

Ok, was free:

padding + 16 partitions:
tps = 147884.648416

padding + 32 partitions:
tps = 141777.841125

padding + 64 partitions:
tps = 141561.539790

padding + 16 partitions + new lwlocks
tps = 601895.580903 (yeha, still reproduces after some sleep!)

Now, the other numbers were best-of-three, these aren't, but I think
it's pretty clear that you're not going to see the same benefits. I am
not surprised...
The current implementation of lwlocks will frequently block others, both
during acquiration and release of locks. What's even worse, trying to
fruitlessly acquire a spinlock will often prevent releasing it because
we need the spinlock during release.
With the proposed algorithm, even if we need the spinlock during release
of an lwlock because there are queued PGPROCs, we will acquire that
spinlock only after already having released the lock...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#5)
Re: Wait free LW_SHARED acquisition

On 27.09.2013 10:21, Andres Freund wrote:

Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:

On 27.09.2013 01:55, Andres Freund wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I see. So if only a few buffers are really hot, I'm assuming the problem
isn't just the buffer partition lock, but also the spinlock on the
buffer header, and the buffer content lwlock. Yeah, improving LWLocks
would be a nice wholesale solution to that. I don't see any fundamental
flaw in your algorithm. Nevertheless, I'm going to throw in a couple of
other ideas:

* Keep a small 4-5 entry cache of buffer lookups in each backend of most
recently accessed buffers. Before searching for a buffer in the
SharedBufHash, check the local cache.

* To pin a buffer, use an atomic fetch-and-add instruction to increase
the refcount. PinBuffer() also increases usage_count, but you could do
that without holding a lock; it doesn't need to be accurate.

One problem with your patch is going to be to make it also work without
the CAS and fetch-and-add instructions. Those are probably present in
all the architectures we support, but it'll take some effort to get the
architecture-specific code done. Until it's all done, it would be good
to be able to fall back to plain spinlocks, which we already have. Also,
when someone ports PostgreSQL to a new architecture in the future, it
would be helpful if you wouldn't need to write all the
architecture-specific code immediately to get it to compile.

Did you benchmark your patch against the compare-and-swap patch I posted
earlier?
(/messages/by-id/519A3587.80603@vmware.com). Just
on a theoretical level, I would assume your patch to scale better since
it uses fetch-and-add instead of compare-and-swap for acquiring a shared
lock. But in practice it might be a wash.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#6)
Re: Wait free LW_SHARED acquisition

On 2013-09-27 09:57:07 +0200, Andres Freund wrote:

On 2013-09-27 09:21:05 +0200, Andres Freund wrote:

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I can run a test on the 4 socket machine if it's unused, but on my 2
socket workstation the benefits of at least our simulation of the
original workloads the improvements were marginal after increasing the
padding to a full cacheline.

Ok, was free:

padding + 16 partitions:
tps = 147884.648416

padding + 32 partitions:
tps = 141777.841125

padding + 64 partitions:
tps = 141561.539790

padding + 16 partitions + new lwlocks
tps = 601895.580903 (yeha, still reproduces after some sleep!)

Pgbench numbers for writes on the machine (fsync = off,
synchronous_commit = off):
padding + 16 partitions:
tps = 8903.532163
vs
padding + 16 partitions + new lwlocks
tps = 13324.080491

So, on bigger machines the advantages seem to be there for writes as
well...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#7)
Re: Wait free LW_SHARED acquisition

On 2013-09-27 11:11:56 +0300, Heikki Linnakangas wrote:

On 27.09.2013 10:21, Andres Freund wrote:

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I see. So if only a few buffers are really hot, I'm assuming the problem
isn't just the buffer partition lock, but also the spinlock on the buffer
header, and the buffer content lwlock. Yeah, improving LWLocks would be a
nice wholesale solution to that. I don't see any fundamental flaw in your
algorithm. Nevertheless, I'm going to throw in a couple of other ideas:

* Keep a small 4-5 entry cache of buffer lookups in each backend of most
recently accessed buffers. Before searching for a buffer in the
SharedBufHash, check the local cache.

I thought about that as well, but you'd either need to revalidate after
pinning the buffers, or keep them pinned.
I had a very hacky implementation of that, but it just make the buffer
content locks the top profile spots. Similar issue there.

It might be worthwile to do this nonetheless - lock xadd; certainly
isn't cheap, even due it's cheaper than a full spinnlock - but it's not
trivial.

* To pin a buffer, use an atomic fetch-and-add instruction to increase the
refcount. PinBuffer() also increases usage_count, but you could do that
without holding a lock; it doesn't need to be accurate.

Yes, Pin/UnpinBuffer() are the primary contention points after this
patch. I want to tackle them, but that seems like a separate thing to
do.

I think we should be able to get rid of most or even all LockBufHdr()
calls by
a) introducing PinBufferInternal() which increase pins but not
usage count using an atomic increment. That can replace locking headers
in many cases.
b) make PinBuffer() increment pin and usagecount using a single 64bit
atomic add if available and fit flags in there as well. Then back off
the usagecount if it's too high or even wraps around, that doesn't hurt
much, we're pinned in that moment.

One problem with your patch is going to be to make it also work without the
CAS and fetch-and-add instructions. Those are probably present in all the
architectures we support, but it'll take some effort to get the
architecture-specific code done. Until it's all done, it would be good to be
able to fall back to plain spinlocks, which we already have. Also, when
someone ports PostgreSQL to a new architecture in the future, it would be
helpful if you wouldn't need to write all the architecture-specific code
immediately to get it to compile.

I think most recent compilers have intrinsics for stuff for operations
like that. I quite like the API provided by gcc for this kind of stuff,
I think we should model an internal wrapper API similarly. I don't see
any new platforming coming that has a compiler without intrinsics?

But yes, you're right, I think we need a spinlock based fallback for
now. Even if it's just because nobody of us can verify the
implementations on the more obsolete platforms we claim to support.I
just didn't see it as a priority in the PoC.

Did you benchmark your patch against the compare-and-swap patch I posted
earlier? (/messages/by-id/519A3587.80603@vmware.com).
Just on a theoretical level, I would assume your patch to scale better since
it uses fetch-and-add instead of compare-and-swap for acquiring a shared
lock. But in practice it might be a wash.

I've tried compare-and-swap for shared acquisition and it performed
worse, didn't try your patch though as you seemed to have concluded it's
a wash after doing the unlocked test.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Florian Pflug
fgp@phlo.org
In reply to: Andres Freund (#1)
Re: Wait free LW_SHARED acquisition

On Sep27, 2013, at 00:55 , Andres Freund <andres@2ndquadrant.com> wrote:

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

That's rougly 400%.

Interesting. I played with pretty much the same idea two years or so ago.
At the time, I compared a few different LWLock implementations. Those
were AFAIR

A) Vanilla LWLocks
B) A + an atomic-increment fast path, very similar to your proposal
C) B but with a partitioned atomic-increment counter to further
reduce cache-line contention
D) A with the spinlock-based queue replaced by a lockless queue

At the time, the improvements seemed to be negligible - they looked great
in synthetic benchmarks of just the locking code, but didn't translate
to improved TPS numbers. Though I think the only version that ever got
tested on more than a handful of cores was C…

My (rather hacked together) benchmarking code can be found here: https://github.com/fgp/lockbench.
The different LWLock implementations live in the various pg_lwlock_* subfolders.

Here's a pointer into the relevant thread: /messages/by-id/651002C1-2EC1-4731-9B29-99217CB36653@phlo.org

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Florian Pflug (#10)
Re: Wait free LW_SHARED acquisition

On 2013-09-27 14:46:50 +0200, Florian Pflug wrote:

On Sep27, 2013, at 00:55 , Andres Freund <andres@2ndquadrant.com> wrote:

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065

That's rougly 400%.

Interesting. I played with pretty much the same idea two years or so ago.
At the time, I compared a few different LWLock implementations. Those
were AFAIR

A) Vanilla LWLocks
B) A + an atomic-increment fast path, very similar to your proposal
C) B but with a partitioned atomic-increment counter to further
reduce cache-line contention
D) A with the spinlock-based queue replaced by a lockless queue

At the time, the improvements seemed to be negligible - they looked great
in synthetic benchmarks of just the locking code, but didn't translate
to improved TPS numbers. Though I think the only version that ever got
tested on more than a handful of cores was C…

I think you really need multi-socket systems to see the big benefits
from this. My laptop barely shows any improvements, while my older 2
socket workstation already shows some in workloads that have more
contention than pgbench -S.

From a quick look, you didn't have any sleeping queueing in at least one
of the variants in there? In my tests, that was tremendously important
to improve scaling if there was any contention. Which is not surprising
in the end, because otherwise you essentially have rw-spinlocks which
really aren't suitable for many of the lwlocks we use.

Getting the queueing semantics, including releaseOK, right was what took
me a good amount of time, the atomic ops part was pretty quick...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Bernd Helmle
mailings@oopsware.de
In reply to: Andres Freund (#6)
Re: Wait free LW_SHARED acquisition

--On 27. September 2013 09:57:07 +0200 Andres Freund
<andres@2ndquadrant.com> wrote:

Ok, was free:

padding + 16 partitions:
tps = 147884.648416

padding + 32 partitions:
tps = 141777.841125

padding + 64 partitions:
tps = 141561.539790

padding + 16 partitions + new lwlocks
tps = 601895.580903 (yeha, still reproduces after some sleep!)

Hmm, out of interest and since i have access to a (atm) free DL580 G7 (4x
E7-4800 10core) i've tried your bench against this machine and got this
(best of three):

HEAD (default):

tps = 181738.607247 (including connections establishing)
tps = 182665.993063 (excluding connections establishing)

HEAD (padding + 16 partitions + your lwlocks patch applied):

tps = 269328.259833 (including connections establishing)
tps = 270685.666091 (excluding connections establishing)

So, still an improvement but far away from what you got. Do you have some
other tweaks in your setup?

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@anarazel.de
In reply to: Bernd Helmle (#12)
Re: Wait free LW_SHARED acquisition

Hi,

On 2013-09-30 18:54:11 +0200, Bernd Helmle wrote:

HEAD (default):

tps = 181738.607247 (including connections establishing)
tps = 182665.993063 (excluding connections establishing)

HEAD (padding + 16 partitions + your lwlocks patch applied):

tps = 269328.259833 (including connections establishing)
tps = 270685.666091 (excluding connections establishing)

So, still an improvement but far away from what you got. Do you have some
other tweaks in your setup?

The only relevant setting changed was -c shared_buffers=1GB, no other
patches applied. At which scale did you pgbench -i?

Your processors are a different microarchitecture, I guess that could
also explain some of the difference. Any chance you could run a perf record -ag
(after compiling with -fno-omit-frame-pointer)?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Bernd Helmle
mailings@oopsware.de
In reply to: Andres Freund (#13)
Re: Wait free LW_SHARED acquisition

--On 30. September 2013 19:00:06 +0200 Andres Freund
<andres@2ndquadrant.com> wrote:

HEAD (default):

tps = 181738.607247 (including connections establishing)
tps = 182665.993063 (excluding connections establishing)

HEAD (padding + 16 partitions + your lwlocks patch applied):

tps = 269328.259833 (including connections establishing)
tps = 270685.666091 (excluding connections establishing)

So, still an improvement but far away from what you got. Do you have some
other tweaks in your setup?

The only relevant setting changed was -c shared_buffers=1GB, no other
patches applied. At which scale did you pgbench -i?

I've used a scale factor of 10 (i recall you've mentioned using the same
upthread...).

Okay, i've used 2GB shared buffers, repeating with your setting i get a far
more noticable speedup:

tps = 346292.008580 (including connections establishing)
tps = 347997.073595 (excluding connections establishing)

Here's the perf output:

+   4.34%      207112      postgres  postgres                 [.] 
AllocSetAlloc
+   4.07%      194476      postgres  libc-2.13.so             [.] 0x127b33
+   2.59%      123471      postgres  postgres                 [.] 
SearchCatCache
+   2.49%      118974       pgbench  libc-2.13.so             [.] 0x11aaef
+   2.48%      118263      postgres  postgres                 [.] 
GetSnapshotData
+   2.46%      117646      postgres  postgres                 [.] 
base_yyparse
+   2.02%       96546      postgres  postgres                 [.] 
MemoryContextAllocZeroAligned
+   1.58%       75326      postgres  postgres                 [.] 
AllocSetFreeIndex
+   1.23%       58587      postgres  postgres                 [.] 
hash_search_with_hash_value
+   1.01%       48391      postgres  postgres                 [.] palloc
+   0.93%       44258      postgres  postgres                 [.] 
LWLockAttemptLock
+   0.91%       43575       pgbench  libc-2.13.so             [.] free
+   0.89%       42484      postgres  postgres                 [.] 
nocachegetattr
+   0.89%       42378      postgres  postgres                 [.] core_yylex
+   0.88%       42001      postgres  postgres                 [.] 
_bt_compare
+   0.84%       39997      postgres  postgres                 [.] 
expression_tree_walker
+   0.76%       36533      postgres  postgres                 [.] 
ScanKeywordLookup
+   0.74%       35515       pgbench  libc-2.13.so             [.] malloc
+   0.64%       30715      postgres  postgres                 [.] 
LWLockRelease
+   0.56%       26779      postgres  postgres                 [.] 
fmgr_isbuiltin
+   0.54%       25681       pgbench  [kernel.kallsyms]        [k] _spin_lock
+   0.48%       22836      postgres  postgres                 [.] new_list
+   0.48%       22700      postgres  postgres                 [.] hash_any
+   0.47%       22378      postgres  postgres                 [.] 
FunctionCall2Coll
+   0.46%       22095      postgres  postgres                 [.] pfree
+   0.44%       20929      postgres  postgres                 [.] palloc0
+   0.43%       20592      postgres  postgres                 [.] 
AllocSetFree
+   0.40%       19495      postgres  [unknown]                [.] 0x81cf2f
+   0.40%       19247      postgres  postgres                 [.] 
hash_uint32
+   0.38%       18227      postgres  postgres                 [.] PinBuffer
+   0.38%       18022       pgbench  [kernel.kallsyms]        [k] do_select

--
Thanks

Bernd

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Merlin Moncure
mmoncure@gmail.com
In reply to: Bernd Helmle (#14)
Re: Wait free LW_SHARED acquisition

On Mon, Sep 30, 2013 at 5:28 PM, Bernd Helmle <mailings@oopsware.de> wrote:

--On 30. September 2013 19:00:06 +0200 Andres Freund
<andres@2ndquadrant.com> wrote:

HEAD (default):

tps = 181738.607247 (including connections establishing)
tps = 182665.993063 (excluding connections establishing)

HEAD (padding + 16 partitions + your lwlocks patch applied):

tps = 269328.259833 (including connections establishing)
tps = 270685.666091 (excluding connections establishing)

So, still an improvement but far away from what you got. Do you have some
other tweaks in your setup?

The only relevant setting changed was -c shared_buffers=1GB, no other
patches applied. At which scale did you pgbench -i?

I've used a scale factor of 10 (i recall you've mentioned using the same
upthread...).

Okay, i've used 2GB shared buffers, repeating with your setting i get a far
more noticable speedup:

If Andres's patch passes muster it may end up causing us to
re-evaluate practices for the shared buffer setting. I was trying to
optimize buffer locking in the clock sweep using a different approach
and gave up after not being able to find useful test cases to
demonstrate an improvement. The main reason for this is that clock
sweep issues are masked by contention in the buffer mapping lwlocks
(as you guys noted). I *do* think clock sweep contention comes out in
some production workloads but so far have been elusive to produce in
synthetic testing. Ditto buffer pin contention (this has been
documented).

So I'm very excited about this patch. Right now in servers I
configure (even some very large ones) I set shared buffers to max 2gb
for various reasons. Something tells me that's about to change.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Bernd Helmle (#14)
Re: Wait free LW_SHARED acquisition

On 2013-10-01 00:28:55 +0200, Bernd Helmle wrote:

--On 30. September 2013 19:00:06 +0200 Andres Freund
<andres@2ndquadrant.com> wrote:

HEAD (default):

tps = 181738.607247 (including connections establishing)
tps = 182665.993063 (excluding connections establishing)

HEAD (padding + 16 partitions + your lwlocks patch applied):

tps = 269328.259833 (including connections establishing)
tps = 270685.666091 (excluding connections establishing)

So, still an improvement but far away from what you got. Do you have some
other tweaks in your setup?

The only relevant setting changed was -c shared_buffers=1GB, no other
patches applied. At which scale did you pgbench -i?

I've used a scale factor of 10 (i recall you've mentioned using the same
upthread...).

Okay, i've used 2GB shared buffers, repeating with your setting i get a far
more noticable speedup:

tps = 346292.008580 (including connections establishing)
tps = 347997.073595 (excluding connections establishing)

Could you send hierarchical profiles of both 1 and 2GB? It's curious
that the difference is that big... Even though they will be a bit big,
it'd be helpful if you pasted the output of "perf report --stdio", to
include the callers...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Wait free LW_SHARED acquisition - v0.2

Hi,

On 2013-09-27 00:55:45 +0200, Andres Freund wrote:

So what's todo? The file header tells us:
* - revive pure-spinlock implementation
* - abstract away atomic ops, we really only need a few.
* - CAS
* - LOCK XADD
* - convert PGPROC->lwWaitLink to ilist.h slist or even dlist.
* - remove LWLockWakeup dealing with MyProc
* - overhaul the mask offsets, make SHARED/EXCLUSIVE_LOCK_MASK wider, MAX_BACKENDS

So, here's the next version of this patchset:
1) I've added an abstracted atomic ops implementation. Needs a fair
amount of work, also submitted as a separate CF entry. (Patch 1 & 2)
2) I've converted PGPROC->lwWaiting into a dlist. That makes a fair bit
of code easier to read and reduces the size of the patchset. Also
fixes a bug in the xlog-scalability code. (Patch 3)
3) Alvaro and I updated the comments in lwlock.c. (Patch 4)

I think 2) should be committable pretty soon. It's imo a pretty clear
win in readability. 1) will need a good bit of more work.

With regard to the scalable lwlock work, what's most needed now is a
good amount of testing.

Please note that you need to 'autoreconf' after applying the patchset. I
don't have a compatible autoconf version on this computer causing the
diff to be humongous if I include those changes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Try-to-make-HP-UX-s-ac-not-cause-warnings-about-unus.patchtext/x-patch; charset=us-asciiDownload+3-2
0002-Very-basic-atomic-ops-implementation.patchtext/x-patch; charset=iso-8859-1Download+2078-1024
0003-Convert-the-PGPROC-lwWaitLink-list-into-a-dlist-inst.patchtext/x-patch; charset=us-asciiDownload+90-150
0004-Wait-free-LW_SHARED-lwlock-acquiration.patchtext/x-patch; charset=us-asciiDownload+523-232
#18Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
Re: Wait free LW_SHARED acquisition - v0.2

On 2013-11-15 20:47:26 +0100, Andres Freund wrote:

Hi,

On 2013-09-27 00:55:45 +0200, Andres Freund wrote:

So what's todo? The file header tells us:
* - revive pure-spinlock implementation
* - abstract away atomic ops, we really only need a few.
* - CAS
* - LOCK XADD
* - convert PGPROC->lwWaitLink to ilist.h slist or even dlist.
* - remove LWLockWakeup dealing with MyProc
* - overhaul the mask offsets, make SHARED/EXCLUSIVE_LOCK_MASK wider, MAX_BACKENDS

So, here's the next version of this patchset:
1) I've added an abstracted atomic ops implementation. Needs a fair
amount of work, also submitted as a separate CF entry. (Patch 1 & 2)
2) I've converted PGPROC->lwWaiting into a dlist. That makes a fair bit
of code easier to read and reduces the size of the patchset. Also
fixes a bug in the xlog-scalability code. (Patch 3)
3) Alvaro and I updated the comments in lwlock.c. (Patch 4)

I think 2) should be committable pretty soon. It's imo a pretty clear
win in readability. 1) will need a good bit of more work.

With regard to the scalable lwlock work, what's most needed now is a
good amount of testing.

Please note that you need to 'autoreconf' after applying the patchset. I
don't have a compatible autoconf version on this computer causing the
diff to be humongous if I include those changes.

Please also note that due to the current state of the atomics
implementation this likely will only work on a somewhat recent gcc and
that the performance might be slightly worse than in the previous
version because the atomic add is implemented using the CAS fallback.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#17)
Re: Wait free LW_SHARED acquisition - v0.2

On Fri, 2013-11-15 at 20:47 +0100, Andres Freund wrote:

So, here's the next version of this patchset:

The 0002 patch contains non-ASCII, non-UTF8 characters:

0002-Very-basic-atomic-ops-implementation.patch: line 609, char 1, byte offset 43: invalid UTF-8 code

Please change that to ASCII, or UTF-8 if necessary.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#17)
Re: Wait free LW_SHARED acquisition - v0.2

This patch didn't make it out of the 2013-11 commit fest. You should
move it to the next commit fest (probably with an updated patch)
before January 15th.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#17)
#22Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#21)
In reply to: Andres Freund (#22)
In reply to: Peter Geoghegan (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#23)
In reply to: Andres Freund (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#26)
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#28)
#30Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#29)
In reply to: Andres Freund (#29)
In reply to: Andres Freund (#29)
#33Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#32)
#34Christian Kruse
christian@2ndquadrant.com
In reply to: Andres Freund (#1)
In reply to: Christian Kruse (#34)
In reply to: Christian Kruse (#34)
#37Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#36)
In reply to: Andres Freund (#37)
In reply to: Christian Kruse (#34)
#40Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#38)
#41Christian Kruse
christian@2ndquadrant.com
In reply to: Peter Geoghegan (#39)
#42Christian Kruse
christian@2ndquadrant.com
In reply to: Christian Kruse (#41)
#43Christian Kruse
christian@2ndquadrant.com
In reply to: Andres Freund (#40)
In reply to: Christian Kruse (#43)
#45Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#44)
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#22)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#22)
#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#48)
#50Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#50)
#52Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#51)
#53Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#52)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#53)
#55Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#54)
#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#55)
#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#55)
#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#56)
#59Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#60Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#59)
#61Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Heikki Linnakangas (#60)
#62Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#58)
#63Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#62)
#64Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#62)
#66Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#65)
#67Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#65)
#68Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#67)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#64)
#70Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#69)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#64)
#72Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#68)
#73Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#64)
#74Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#73)
#75Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#74)
#76Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#64)
#77Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#76)
#78Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#72)
#79Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#71)
#80Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#77)
#81Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#80)
#82Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#81)
#83Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#81)
#84Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#83)
#85Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#85)
#87Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#86)
#88Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#87)
#89Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#88)
#91Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#89)
#92Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#82)
#93Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#92)
#94Merlin Moncure
mmoncure@gmail.com
In reply to: Andres Freund (#93)
#95Amit Kapila
amit.kapila16@gmail.com
In reply to: Merlin Moncure (#94)
#96Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#91)
#97Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#96)
#98Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#97)
#99Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#62)
#100Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#99)
#101Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#100)
#102Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#99)
#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#101)
#104Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#102)
#105Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#104)
#106Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#105)
#107Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#98)
#108Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#107)
#109Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#108)
#110Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#109)
#111Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#110)
#112Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#106)
#113Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#112)
#114Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#113)
#115Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#114)
#116Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#115)
#117Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)