Move PinBuffer and UnpinBuffer to atomics
Hello hackers!
Continuing the theme: /messages/by-id/3368228.mTSz6V0Jsq@dinodell
This time, we fairly rewrote 'refcount' and 'usage_count' to atomic in
PinBuffer and UnpinBuffer (but save lock for buffer flags in Unpin).
In the same time it doesn't affect to correctness of buffer manager
because that variables already have LWLock on top of them (for partition of
hashtable). If someone pinned buffer after the call StrategyGetBuffer we just
try again (in BufferAlloc). Also in the code there is one more check before
deleting the old buffer, where changes can be rolled back. The other functions
where it is checked 'refcount' and 'usage_count' put exclusive locks.
Also stress test with 256 KB shared memory ended successfully.
Without patch we have 417523 TPS and with patch 965821 TPS for big x86 server.
All details here: https://gist.github.com/stalkerg/773a81b79a27b4d5d63f
Thank you.
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
atomic_bufmgr_v5.patchtext/x-patch; charset=utf-8; name=atomic_bufmgr_v5.patchDownload+61-53
Hi,
On 2015-09-11 13:23:24 +0300, YUriy Zhuravlev wrote:
Continuing the theme: /messages/by-id/3368228.mTSz6V0Jsq@dinodell
Please don't just start new threads for a new version of the patch.
This time, we fairly rewrote 'refcount' and 'usage_count' to atomic in
PinBuffer and UnpinBuffer (but save lock for buffer flags in Unpin).
Hm.
In the same time it doesn't affect to correctness of buffer manager
because that variables already have LWLock on top of them (for partition of
hashtable).
Note that there's a pending patch that removes the buffer mapping locks
entirely.
If someone pinned buffer after the call StrategyGetBuffer we just try
again (in BufferAlloc). Also in the code there is one more check
before deleting the old buffer, where changes can be rolled back. The
other functions where it is checked 'refcount' and 'usage_count' put
exclusive locks.
I don't think this is correct. This way we can leave the for (;;) loop
in BufferAlloc() thinking that the buffer is unused (and can't be further
pinned because of the held spinlock!) while it actually has been pinned
since by PinBuffer(). Additionally oldFlags can get out of sync there.
I don't think the approach of making some of the fields atomics but not
really caring about the rest is going to work. My suggestion is to add a
single 'state' 32bit atomic. This 32bit state is subdivided into:
10bit for flags,
3bit for usage_count,
16bit for refcount
then turn each operation that currently uses one of these fields into
corresponding accesses (just different values for flags, bit-shiftery &
mask for reading usage count, bit mask for reading refcount). The trick
then is to add a *new* flag value BM_LOCKED. This can then act as a sort
of a 'one bit' spinlock.
That should roughly look like (more or less pseudocode):
void
LockBufHdr(BufferDesc *desc)
{
int state = pg_atomic_read_u32(&desc->state);
for (;;)
{
/* wait till lock is free */
while (unlikely(state & BM_LOCKED))
{
pg_spin_delay();
state = pg_atomic_read_u32(&desc->state);
/* add exponential backoff? Should seldomly be contended tho. */
}
/* and try to get lock */
if (pg_atomic_compare_exchange_u32(&desc->state, &state, state | BM_LOCKED))
break;
}
}
static bool
PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
{
...
if (ref == NULL)
{
ReservePrivateRefCountEntry();
ref = NewPrivateRefCountEntry(b);
...
int state = pg_atomic_read_u32(&desc->state);
int oldstate = state;
while (true)
{
/* spin-wait till lock is free */
while (unlikely(state & BM_LOCKED))
{
pg_spin_delay();
state = pg_atomic_read_u32(&desc->state);
}
/* increase refcount */
state += 1;
/* increase usagecount unless already max */
if ((state & USAGE_COUNT_MASK) != BM_MAX_USAGE_COUNT)
state += BM_USAGE_COUNT_ONE;
result = (state & BM_VALID) != 0;
if (pg_atomic_compare_exchange_u32(&desc->state, &oldstate, state))
break;
/* get ready for next loop, oldstate has been updated by cas */
state = oldstate;
}
...
}
other callsites can either just plainly continue to use
LockBufHdr/UnlockBufHdr or converted similarly to PinBuffer().
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11 September 2015 at 22:23, YUriy Zhuravlev <u.zhuravlev@postgrespro.ru>
wrote:
Without patch we have 417523 TPS and with patch 965821 TPS for big x86
server.
All details here: https://gist.github.com/stalkerg/773a81b79a27b4d5d63f
Impressive!
I've run this on a single CPU server and don't see any speedup, so I assume
I'm not getting enough contention.
As soon as our 4 socket machine is free I'll try a pgbench run with that.
Just for fun, what's the results if you use -M prepared ?
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
On Friday 11 September 2015 18:14:21 Andres Freund wrote:
This way we can leave the for (;;) loop
in BufferAlloc() thinking that the buffer is unused (and can't be further
pinned because of the held spinlock!)
We lost lock after PinBuffer_Locked in BufferAlloc. Therefore, in essence,
nothing has changed.
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-11 19:33:26 +0300, YUriy Zhuravlev wrote:
On Friday 11 September 2015 18:14:21 Andres Freund wrote:
This way we can leave the for (;;) loop
in BufferAlloc() thinking that the buffer is unused (and can't be further
pinned because of the held spinlock!)We lost lock after PinBuffer_Locked in BufferAlloc. Therefore, in essence,
nothing has changed.
The relevant piece of code is:
/*
* Need to lock the buffer header too in order to change its tag.
*/
LockBufHdr(buf);
/*
* Somebody could have pinned or re-dirtied the buffer while we were
* doing the I/O and making the new hashtable entry. If so, we can't
* recycle this buffer; we must undo everything we've done and start
* over with a new victim buffer.
*/
oldFlags = buf->flags;
if (buf->refcount == 1 && !(oldFlags & BM_DIRTY))
break;
UnlockBufHdr(buf);
BufTableDelete(&newTag, newHash);
if ((oldFlags & BM_TAG_VALID) &&
oldPartitionLock != newPartitionLock)
LWLockRelease(oldPartitionLock);
LWLockRelease(newPartitionLock);
UnpinBuffer(buf, true);
}
/*
* Okay, it's finally safe to rename the buffer.
*
* Clearing BM_VALID here is necessary, clearing the dirtybits is just
* paranoia. We also reset the usage_count since any recency of use of
* the old content is no longer relevant. (The usage_count starts out at
* 1 so that the buffer can survive one clock-sweep pass.)
*/
buf->tag = newTag;
buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
if (relpersistence == RELPERSISTENCE_PERMANENT)
buf->flags |= BM_TAG_VALID | BM_PERMANENT;
else
buf->flags |= BM_TAG_VALID;
buf->usage_count = 1;
UnlockBufHdr(buf);
so unless I'm missing something, no, we haven't lost the lock.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Friday 11 September 2015 18:37:00 you wrote:
so unless I'm missing something, no, we haven't lost the lock.
This section is protected by like LWLockAcquire(newPartitionLock,
LW_EXCLUSIVE); before it (and we can't get this buffer from hash table).
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-11 19:46:02 +0300, YUriy Zhuravlev wrote:
On Friday 11 September 2015 18:37:00 you wrote:
so unless I'm missing something, no, we haven't lost the lock.
This section is protected by like LWLockAcquire(newPartitionLock,
LW_EXCLUSIVE); before it (and we can't get this buffer from hash table).
a) As I said upthread there's a patch to remove these locks entirely
b) It doesn't matter anyway. Not every pin goes through the buffer
mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Friday 11 September 2015 18:50:35 you wrote:
a) As I said upthread there's a patch to remove these locks entirely
It is very interesting. Could you provide a link? And it's not very good,
since there is a bottleneck PinBuffer / UnpinBuffer instead of LWLocks.
b) It doesn't matter anyway. Not every pin goes through the buffer
mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...
StrategyGetBuffer call only from BufferAlloc .
SyncOneBuffer not problem too because:
PinBuffer_Locked(bufHdr);
LWLockAcquire(bufHdr->content_lock, LW_SHARED);
And please read comment before LockBufHdr(bufHdr) in SyncOneBuffer.
We checked all functions with refcount and usage_count.
Thanks! ^_^
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-14 13:16:46 +0300, YUriy Zhuravlev wrote:
On Friday 11 September 2015 18:50:35 you wrote:
a) As I said upthread there's a patch to remove these locks entirely
It is very interesting. Could you provide a link?
And it's not very good,
since there is a bottleneck PinBuffer / UnpinBuffer instead of
LWLocks.
Where the bottleneck is entirely depends on your workload. If you have a
high cache replacement ratio the mapping partition locks are frequently
going to be held exclusively.
b) It doesn't matter anyway. Not every pin goes through the buffer
mapping table. StrategyGetBuffer(), SyncOneBuffer(), ...
StrategyGetBuffer call only from BufferAlloc .
It gets called without buffer mapping locks held. And it can (and
frequently will!) access all the buffers in the buffer pool.
SyncOneBuffer not problem too because:
PinBuffer_Locked(bufHdr);
Which you made ineffective because PinBuffer() doesn't take a lock
anymore. Mutual exclusion through locks only works if all participants
take the locks.
We checked all functions with refcount and usage_count.
Adding lockless behaviour by just taking out locks without analyzing the
whole isn't going to fly. You either need to provide backward
compatibility (a LockBuffer that provides actual exclusion) or you
actually need to go carefully through all relevant code and make it
lock-free.
I pointed out how you can actually make this safely lock-free giving you
the interesting code.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-14 17:41:42 +0200, Andres Freund wrote:
I pointed out how you can actually make this safely lock-free giving you
the interesting code.
And here's an actual implementation of that approach. It's definitely
work-in-progress and could easily be optimized further. Don't have any
big machines to play around with right now tho.
Andres
Attachments:
bufferpin.difftext/x-diff; charset=us-asciiDownload+312-158
On Tuesday 15 September 2015 04:06:25 Andres Freund wrote:
And here's an actual implementation of that approach. It's definitely
work-in-progress and could easily be optimized further. Don't have any
big machines to play around with right now tho.
Thanks. Interesting.
We had a version like your patch. But this is only half the work. Example:
state = pg_atomic_read_u32(&buf->state);
if ((state & BUF_REFCOUNT_MASK) == 0
&& (state & BUF_USAGECOUNT_MASK) == 0)
After the first command somebody can change buf->state and local state not
actual.
In this embodiment, there is no significant difference between the two
patches. For honest work will need used the CAS for all IF statement.
Thanks! Hope for understanding. ^_^
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Saturday 12 September 2015 04:15:43 David Rowley wrote:
I've run this on a single CPU server and don't see any speedup, so I assume
I'm not getting enough contention.
As soon as our 4 socket machine is free I'll try a pgbench run with that.
Excellent! Will wait.
Just for fun, what's the results if you use -M prepared ?
Unfortunately now we can not check. :(
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 14, 2015 at 9:06 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-14 17:41:42 +0200, Andres Freund wrote:
I pointed out how you can actually make this safely lock-free giving you
the interesting code.And here's an actual implementation of that approach. It's definitely
work-in-progress and could easily be optimized further. Don't have any
big machines to play around with right now tho.
Are you confident this is faster across all workloads? Pin/Unpin are
probably faster but this comes at a cost of extra atomic ops during
the clock sweep loop. I wonder if this will degrade results under
heavy contention.
Also, I'm curious about your introduction of __builtin_expect()
macros. Did you measure any gain from them? I bet there are other
places they could be used -- for example the mvcc hint bit checks on
xmin.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-15 12:51:24 +0300, YUriy Zhuravlev wrote:
We had a version like your patch. But this is only half the work. Example:
state = pg_atomic_read_u32(&buf->state);
if ((state & BUF_REFCOUNT_MASK) == 0
&& (state & BUF_USAGECOUNT_MASK) == 0)
After the first command somebody can change buf->state and local state not
actual.
No, they can't in a a relevant manner. We hold the buffer header lock.
In this embodiment, there is no significant difference between the two
patches. For honest work will need used the CAS for all IF statement.
What?
Thanks! Hope for understanding. ^_^
There's pretty little understanding left at this point. You're posting
things for review and you seem completely unwilling to actually respond
to points raised.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-15 08:07:57 -0500, Merlin Moncure wrote:
Are you confident this is faster across all workloads?
No. This is a proof of concept I just wrote & posted because I didn't
see the patch moving in the right direction. But I do think it can be
made faster in all relevant workloads.
Pin/Unpin are probably faster but this comes at a cost of extra atomic
ops during the clock sweep loop. I wonder if this will degrade
results under heavy contention.
I think it's actually going to be faster under contention, and the
situation where it's slower is uncontended workloads where you a very
very low cache hit ratio.
Also, I'm curious about your introduction of __builtin_expect()
macros. Did you measure any gain from them?
I introduced them because I was bothered by the generated assembler ;)
But a bit more seriously, I do think there's some benefit in influencing
the code like that. I personally also find they *increase* readability
in cases like this where the likely() branch should be taken just about
all the time.
I bet there are other places they could be used -- for example the
mvcc hint bit checks on xmin.
I don't think those are good candidates, there's too many cases where
it's common to have the majority of cases go the other way.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 15, 2015 at 9:56 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-15 08:07:57 -0500, Merlin Moncure wrote:
Also, I'm curious about your introduction of __builtin_expect()
macros. Did you measure any gain from them?I introduced them because I was bothered by the generated assembler ;)
But a bit more seriously, I do think there's some benefit in influencing
the code like that. I personally also find they *increase* readability
in cases like this where the likely() branch should be taken just about
all the time.
right. For posterity, I agree with this.
I bet there are other places they could be used -- for example the
mvcc hint bit checks on xmin.I don't think those are good candidates, there's too many cases where
it's common to have the majority of cases go the other way.
Maybe, but, consider that penalty vs win is asymmetric. If the hint
bit isn't set, you're doing a lot of other work anyways such that the
branch penalty falls away to noise while if you win the benefits are
significant against the tight tuple scan loop.
Anyways, as it pertains to *this* patch, +1 for adding that feature.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tuesday 15 September 2015 16:50:44 Andres Freund wrote:
No, they can't in a a relevant manner. We hold the buffer header lock.
I'm sorry, I did not notice of a LockBufHdr.
In this embodiment, your approach seems to be very similar to s_lock. Cycle in
PinBuffer behaves like s_lock.
In LockBufHdr:
if (pg_atomic_compare_exchange_u32(&desc->state, &state, state | BM_LOCKED))
conflict with:
while (unlikely(state & BM_LOCKED))
from PinBuffer.
Thus your patch does not remove the problem of competition for PinBuffer.
We will try check your patch this week.
You're posting
things for review and you seem completely unwilling to actually respond
to points raised.
I think we're just talking about different things.
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-15 19:43:28 +0300, YUriy Zhuravlev wrote:
On Tuesday 15 September 2015 16:50:44 Andres Freund wrote:
No, they can't in a a relevant manner. We hold the buffer header lock.
I'm sorry, I did not notice of a LockBufHdr.
In this embodiment, your approach seems to be very similar to s_lock. Cycle in
PinBuffer behaves like s_lock.
In LockBufHdr:
if (pg_atomic_compare_exchange_u32(&desc->state, &state, state | BM_LOCKED))conflict with:
while (unlikely(state & BM_LOCKED))
from PinBuffer.
Thus your patch does not remove the problem of competition for PinBuffer.
We will try check your patch this week.
That path is only taken if somebody else has already locked the buffer
(e.g. BufferAlloc()). If you have contention in PinBuffer() your
workload will be mostly cache resident and neither PinBuffer() nor
UnpinBuffer() set BM_LOCKED.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
That path is only taken if somebody else has already locked the buffer
(e.g. BufferAlloc()). If you have contention in PinBuffer() your
workload will be mostly cache resident and neither PinBuffer() nor
UnpinBuffer() set BM_LOCKED.
Thanks. Now I understand everything. It might work.
We will be tested.
your workload
Simple pgbench -S for NUMA.
--
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-09-15 20:16:10 +0300, YUriy Zhuravlev wrote:
We will be tested.
Did you have a chance to run some benchmarks?
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers