Lockless StrategyGetBuffer() clock sweep

Started by Andres Freundabout 11 years ago12 messages
#1Andres Freund
andres@2ndquadrant.com
1 attachment(s)

Hi,

I've previously posted a patch at
http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
that reduces contention in StrategyGetBuffer() by making the clock sweep
lockless. Robert asked me to post it to a new thread; I originally
wrote it to see some other contention in more detail, that's why it
ended up in that thread...

The performance numbers I've quoted over there are:

Test:
pgbench -M prepared -P 5 -S -c 496 -j 496 -T 5000
on a scale=1000 database, with 4GB of shared buffers.

Before:
progress: 40.0 s, 136252.3 tps, lat 3.628 ms stddev 4.547
progress: 45.0 s, 135049.0 tps, lat 3.660 ms stddev 4.515
progress: 50.0 s, 135788.9 tps, lat 3.640 ms stddev 4.398
progress: 55.0 s, 135268.4 tps, lat 3.654 ms stddev 4.469
progress: 60.0 s, 134991.6 tps, lat 3.661 ms stddev 4.739

after:
progress: 40.0 s, 207701.1 tps, lat 2.382 ms stddev 3.018
progress: 45.0 s, 208022.4 tps, lat 2.377 ms stddev 2.902
progress: 50.0 s, 209187.1 tps, lat 2.364 ms stddev 2.970
progress: 55.0 s, 206462.7 tps, lat 2.396 ms stddev 2.871
progress: 60.0 s, 210263.8 tps, lat 2.351 ms stddev 2.914

Imo the patch doesn't complicate the logic noticeably...

I do wonder if we could make the freelist accesses lockless as well -
but I think that's harder. So I don't want to mix that with this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-WIP-lockless-StrategyGetBuffer-hotpath.patchtext/x-patch; charset=us-asciiDownload
>From 6b486e5b467e94ab9297d7656a5b39b816c5c55a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 10 Oct 2014 17:36:46 +0200
Subject: [PATCH] WIP: lockless StrategyGetBuffer hotpath

---
 src/backend/storage/buffer/freelist.c | 154 ++++++++++++++++++++--------------
 1 file changed, 90 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5966beb..0c634a0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -18,6 +18,12 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
+#include "port/atomics.h"
+
+
+#define LATCHPTR_ACCESS_ONCE(var)	((Latch *)(*((volatile Latch **)&(var))))
+#define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
+
 
 /*
  * The shared freelist control information.
@@ -27,8 +33,12 @@ typedef struct
 	/* Spinlock: protects the values below */
 	slock_t		buffer_strategy_lock;
 
-	/* Clock sweep hand: index of next buffer to consider grabbing */
-	int			nextVictimBuffer;
+	/*
+	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
+	 * this isn't a concrete buffer - we only ever increase the value. So, to
+	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 */
+	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
@@ -42,8 +52,8 @@ typedef struct
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
 	 */
-	uint32		completePasses; /* Complete cycles of the clock sweep */
-	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
+	pg_atomic_uint32 completePasses; /* Complete cycles of the clock sweep */
+	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
 	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
@@ -124,87 +134,107 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			return buf;
 	}
 
-	/* Nope, so lock the freelist */
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	StrategyControl->numBufferAllocs++;
+	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding buffer_strategy_lock; so release and re-grab.
-	 * This is annoyingly tedious, but it happens at most once per bgwriter
-	 * cycle, so the performance hit is minimal.
+	 * If bgwriterLatch is set, we need to waken the bgwriter. We don't want
+	 * to check this while holding the spinlock, so we the latch from memory
+	 * once to see whether it's set. There's no need to do so with a lock
+	 * present - we'll just set the latch next call if we missed it once.
+	 *
+	 * Since we're not guaranteed atomic 8 byte reads we need to acquire the
+	 * spinlock if not null to be sure we get a correct pointer. Because we
+	 * don't want to set the latch while holding the buffer_strategy_lock we
+	 * just grab the lock to read and reset the pointer.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
+	bgwriterLatch = LATCHPTR_ACCESS_ONCE(StrategyControl->bgwriterLatch);
 	if (bgwriterLatch)
 	{
+		/* we don't have guaranteed atomic 64bit reads */
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+		bgwriterLatch = LATCHPTR_ACCESS_ONCE(StrategyControl->bgwriterLatch);
 		StrategyControl->bgwriterLatch = NULL;
 		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-		SetLatch(bgwriterLatch);
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+		/* recheck */
+		if (bgwriterLatch)
+			SetLatch(bgwriterLatch);
 	}
 
 	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the buffer_strategy_lock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
+	 * First check, without acquiring the lock, wether there's buffers in the
+	 * freelist. Since we otherwise don't require the spinlock in every
+	 * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
+	 * uselessly in mos tcases.
+	 *
+	 * If there's buffers on the freelist, acquire the spinlock and look into
+	 * them.
+	 *
+	 * Note that the freeNext fields are considered to be protected by
+	 * the buffer_strategy_lock not the individual buffer spinlocks, so it's
+	 * OK to manipulate them without holding the spinlock.
 	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	if (INT_ACCESS_ONCE(StrategyControl->firstFreeBuffer) >= 0)
 	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		while (StrategyControl->firstFreeBuffer >= 0)
+		{
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * Release the lock so someone else can access the freelist (or run
-		 * the clocksweep) while we check out this buffer.
-		 */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
-		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
-		}
-		UnlockBufHdr(buf);
+			/*
+			 * Release the lock so someone else can access the freelist (or run
+			 * the clocksweep) while we check out this buffer.
+			 */
+			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 
-		/* Reacquire the lock and go around for another pass. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			/* Reacquire the lock and go around for another pass. */
+			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+		}
+		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		int victim;
+
+		victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
 
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
+		buf = &BufferDescriptors[victim % NBuffers];
+
+		if (victim % NBuffers == 0)
 		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			pg_atomic_add_fetch_u32(&StrategyControl->completePasses, 1);
 		}
 
-		/* Release the lock before manipulating the candidate buffer. */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -238,9 +268,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			elog(ERROR, "no unpinned buffers available");
 		}
 		UnlockBufHdr(buf);
-
-		/* Reacquire the lock and get a new candidate buffer. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 }
 
@@ -284,13 +311,12 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 	int			result;
 
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	result = StrategyControl->nextVictimBuffer;
+	result = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer) % NBuffers;
 	if (complete_passes)
-		*complete_passes = StrategyControl->completePasses;
+		*complete_passes = pg_atomic_read_u32(&StrategyControl->completePasses);
 	if (num_buf_alloc)
 	{
-		*num_buf_alloc = StrategyControl->numBufferAllocs;
-		StrategyControl->numBufferAllocs = 0;
+		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
 	}
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
@@ -389,11 +415,11 @@ StrategyInitialize(bool init)
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
-		StrategyControl->nextVictimBuffer = 0;
+		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
 
 		/* Clear statistics */
-		StrategyControl->completePasses = 0;
-		StrategyControl->numBufferAllocs = 0;
+		pg_atomic_init_u32(&StrategyControl->completePasses, 0);
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
-- 
1.8.3.251.g1462b67

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Lockless StrategyGetBuffer() clock sweep

On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I've previously posted a patch at
http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
that reduces contention in StrategyGetBuffer() by making the clock sweep
lockless. Robert asked me to post it to a new thread; I originally
wrote it to see some other contention in more detail, that's why it
ended up in that thread...

Does this LATCHPTR_ACCESS_ONCE crap really do anything? Why do we
need that? I'm not convinced that it's actually solving any problem
we would otherwise have with this code. But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Actually, the same approach would work for INT_ACCESS_ONCE. So I
propose this approach instead: Add a new atomic flag to
StrategyControl, useSlowPath. If StrategyGetBuffer() finds
useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
spinlock, check whether bgwriterLatch is set and/or the freelist is
non-empty and return a buffer ID if able to allocate from the
freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we
decide not to call it in the first place because useSlowPath isn't
set, then fall through to your clock-sweep logic. We can set
useSlowPath at stratup and whenever we put buffers on the free list.

The lack of memory barriers while testing useSlowPath (or, in your
version, when testing the ACCESS_ONCE conditions) means that we could
fail to set the bgwriter latch or pop from the freelist if another
backend has very recently updated those things. But that's not
actually a problem; it's fine for any individual request to skip those
things, as they are more like hints than correctness requirements.

The interaction between this and the bgreclaimer idea is interesting.
We can't making popping the freelist lockless without somehow dealing
with the resulting A-B-A problem (namely, that between the time we
read &head->next and the time we CAS the list-head to that value, the
head might have been popped, another item pushed, and the original
list head pushed again). So even if bgreclaimer saves some work for
individual backends - avoiding the need for them to clock-sweep across
many buffers - it may not be worth it if it means taking a spinlock to
pop the freelist instead of using an atomics-driven clock sweep.
Considering that there may be a million plus buffers to scan through,
that's a surprising conclusion, but it seems to be where the data is
pointing us.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: Lockless StrategyGetBuffer() clock sweep

On 2014-10-29 14:18:33 -0400, Robert Haas wrote:

On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I've previously posted a patch at
http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
that reduces contention in StrategyGetBuffer() by making the clock sweep
lockless. Robert asked me to post it to a new thread; I originally
wrote it to see some other contention in more detail, that's why it
ended up in that thread...

Does this LATCHPTR_ACCESS_ONCE crap really do anything?

Well, t forces the variable to be read from memory, instead of allowing
it to be read from a register... I'd much rather have a generic, type
independent, ACCESS_ONCE macro, but I'm not aware how to do that in a
way that works across compilers. It also prevents the compiler from
doing speculative writes.

Why do we need that? I'm not convinced that it's actually solving any
problem we would otherwise have with this code.

I agree that in this case we can get rid of it.

But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Yea, that'd be nicer. I didn't do it because it made the patch slightly
more invasive... The complexity really is only needed because we're not
guaranteed that 64bit reads are atomic. Although we actually can be
sure, because there's no platform with nonatomic intptr_t reads - so
64bit platforms actually *do* have atomic 64bit reads/writes.

So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
don't even need to take a lock to set it. Afaics the worst that can
happen is that several processes set the latch...

Actually, the same approach would work for INT_ACCESS_ONCE. So I
propose this approach instead: Add a new atomic flag to
StrategyControl, useSlowPath. If StrategyGetBuffer() finds
useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
spinlock, check whether bgwriterLatch is set and/or the freelist is
non-empty and return a buffer ID if able to allocate from the
freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we
decide not to call it in the first place because useSlowPath isn't
set, then fall through to your clock-sweep logic. We can set
useSlowPath at stratup and whenever we put buffers on the free list.

I don't like the idea to to conflate the slowpath for the bgwriter latch
with the freelist slowpath. And I don't really see what that'd buy us
over what's in the patch right now?

The lack of memory barriers while testing useSlowPath (or, in your
version, when testing the ACCESS_ONCE conditions) means that we could
fail to set the bgwriter latch or pop from the freelist if another
backend has very recently updated those things. But that's not
actually a problem; it's fine for any individual request to skip those
things, as they are more like hints than correctness requirements.

Right.

The interaction between this and the bgreclaimer idea is interesting.
We can't making popping the freelist lockless without somehow dealing
with the resulting A-B-A problem (namely, that between the time we
read &head->next and the time we CAS the list-head to that value, the
head might have been popped, another item pushed, and the original
list head pushed again).

I think if we really feel the need to, we can circumvent the ABA problem
here. But I'm not yet convinced that there's the need to do so. I'm
unsure that a single process that touches all the buffers at some
frequency is actually a good idea on modern NUMA systems...

I wonder if we could make a trylock for spinlocks work - then we could
look at the freelist if the lock is free and just go to the clock cycle
otherwise. My guess is that that'd be quite performant. IIRC it's just
the spinlock semaphore fallback that doesn't know how to do trylock...

So even if bgreclaimer saves some work for
individual backends - avoiding the need for them to clock-sweep across
many buffers - it may not be worth it if it means taking a spinlock to
pop the freelist instead of using an atomics-driven clock sweep.
Considering that there may be a million plus buffers to scan through,
that's a surprising conclusion, but it seems to be where the data is
pointing us.

I'm not really convinced of this yet either. It might just be that the
bgreclaimer implementation isn't good enough. But to me that doesn't
really change the fact that there's clear benefit in this patch - even
if we can make bgreclaimer beneficial the lockless scan will be good.

My feeling is that make bg*writer* more efficient is more likely to be
beneficial overall than introducing bgreclaimer. I've seen the clock
sweeps really hurt in cases where many many buffers are dirty.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#3)
Re: Lockless StrategyGetBuffer() clock sweep

On Thu, Oct 30, 2014 at 12:39 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-29 14:18:33 -0400, Robert Haas wrote:

The interaction between this and the bgreclaimer idea is interesting.
We can't making popping the freelist lockless without somehow dealing
with the resulting A-B-A problem (namely, that between the time we
read &head->next and the time we CAS the list-head to that value, the
head might have been popped, another item pushed, and the original
list head pushed again).

I think if we really feel the need to, we can circumvent the ABA problem
here. But I'm not yet convinced that there's the need to do so. I'm
unsure that a single process that touches all the buffers at some
frequency is actually a good idea on modern NUMA systems...

I wonder if we could make a trylock for spinlocks work - then we could
look at the freelist if the lock is free and just go to the clock cycle
otherwise. My guess is that that'd be quite performant. IIRC it's just
the spinlock semaphore fallback that doesn't know how to do trylock...

So even if bgreclaimer saves some work for
individual backends - avoiding the need for them to clock-sweep across
many buffers - it may not be worth it if it means taking a spinlock to
pop the freelist instead of using an atomics-driven clock sweep.
Considering that there may be a million plus buffers to scan through,
that's a surprising conclusion, but it seems to be where the data is
pointing us.

I'm not really convinced of this yet either. It might just be that the
bgreclaimer implementation isn't good enough. But to me that doesn't
really change the fact that there's clear benefit in this patch -

I have a feeling that this might also have some regression at higher
loads (like scale_factor = 5000, shared_buffers = 8GB,
client_count = 128, 256) for the similar reasons as bgreclaimer patch,
means although both reduces contention around spin lock, however
it moves contention somewhere else. I have yet to take data before
concluding anything (I am just waiting for your other patch (wait free
LW_SHARED) to be committed).
I think once wait free LW_SHARED is in, we can evaluate this patch
(if required may be we can see if there is any interaction between
this and bgreclaimer). However if you want, I think this can be done
even separately from wait free LW_SHARED patch.

even
if we can make bgreclaimer beneficial the lockless scan will be good.

My feeling is that make bg*writer* more efficient is more likely to be
beneficial overall than introducing bgreclaimer.

One idea could be that we bgwriter as something similar to auto vacuum
launcher, which means that bgwriter can launch different workers based
on the kind of need.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#5Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#4)
Re: Lockless StrategyGetBuffer() clock sweep

On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:

I have a feeling that this might also have some regression at higher
loads (like scale_factor = 5000, shared_buffers = 8GB,
client_count = 128, 256) for the similar reasons as bgreclaimer patch,
means although both reduces contention around spin lock, however
it moves contention somewhere else. I have yet to take data before
concluding anything (I am just waiting for your other patch (wait free
LW_SHARED) to be committed).

I have a hard time to see how this could be. In the uncontended case the
number of cachelines touched and the number of atomic operations is
exactly the same. In the contended case the new implementation does far
fewer atomic ops - and doesn't do spinning.

What's your theory?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: Lockless StrategyGetBuffer() clock sweep

On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:

But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Yea, that'd be nicer. I didn't do it because it made the patch slightly
more invasive... The complexity really is only needed because we're not
guaranteed that 64bit reads are atomic. Although we actually can be
sure, because there's no platform with nonatomic intptr_t reads - so
64bit platforms actually *do* have atomic 64bit reads/writes.

So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
don't even need to take a lock to set it. Afaics the worst that can
happen is that several processes set the latch...

OK, that design is fine with me.

Actually, the same approach would work for INT_ACCESS_ONCE. So I
propose this approach instead: Add a new atomic flag to
StrategyControl, useSlowPath. If StrategyGetBuffer() finds
useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
spinlock, check whether bgwriterLatch is set and/or the freelist is
non-empty and return a buffer ID if able to allocate from the
freelist; otherwise -1. If StrategyGetBufferSlow() returns -1, or we
decide not to call it in the first place because useSlowPath isn't
set, then fall through to your clock-sweep logic. We can set
useSlowPath at stratup and whenever we put buffers on the free list.

I don't like the idea to to conflate the slowpath for the bgwriter latch
with the freelist slowpath. And I don't really see what that'd buy us
over what's in the patch right now?

Well, I was thinking that with a single flag check we could skip two
paths that, as of today, are both uncommonly taken. Moving all that
logic off into a separate function might help the compiler optimize
for the fast case, too.

I wonder if we could make a trylock for spinlocks work - then we could
look at the freelist if the lock is free and just go to the clock cycle
otherwise. My guess is that that'd be quite performant. IIRC it's just
the spinlock semaphore fallback that doesn't know how to do trylock...

I could go for that. I don't think the semaphore thing is really a
problem. If the spinlock semaphore implementation indeed can't
support that, then just have it fall back to waiting for the lock and
always returning true. Semaphores are so slow that it's not going to
make any difference to performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#5)
1 attachment(s)
Re: Lockless StrategyGetBuffer() clock sweep

On Thu, Oct 30, 2014 at 5:01 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:

I have a feeling that this might also have some regression at higher
loads (like scale_factor = 5000, shared_buffers = 8GB,
client_count = 128, 256) for the similar reasons as bgreclaimer patch,
means although both reduces contention around spin lock, however
it moves contention somewhere else. I have yet to take data before
concluding anything (I am just waiting for your other patch (wait free
LW_SHARED) to be committed).

I have a hard time to see how this could be. In the uncontended case the
number of cachelines touched and the number of atomic operations is
exactly the same. In the contended case the new implementation does far
fewer atomic ops - and doesn't do spinning.

What's your theory?

I have observed that once we reduce the contention in one path, it doesn't
always lead to performance/scalability gain and rather shifts to other lock
if that exists. This is the reason why we have to work on reducing
contention
around both BufFreeList and Buf Mapping tables lock together. I have taken
some performance data and it seems this patch also exhibits similar
behaviour
as bgreclaimer patch and I believe resolving contention around dynahash can
improve the situation (Robert's chash patch can be helpful).

Performance Data
------------------------------
Configuration and Db Details
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections = 300
shared_buffers = 8GB
checkpoint_segments=30
checkpoint_timeout =15min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Test mode - pgbench readonly (-M prepared)

Data below is median of 3 runs, for individual run data check document
attached with this mail.

Scale_Factor = 1000
Patch_ver/Client_Count 128 256 HEAD 265502 201689 Patch 283448 224888

Scale_Factor = 5000
Patch_ver/Client_Count 128 256 HEAD 190435 177477 Patch 171485 167794

The above data indicates that there is performance gain at scale factor
1000, however there is a regression at scale factor 5000.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

perf_lockless_strategy_getbuf.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_lockless_strategy_getbuf.odsDownload
#8Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: Lockless StrategyGetBuffer() clock sweep

On 2014-10-30 07:55:08 -0400, Robert Haas wrote:

On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:

But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Yea, that'd be nicer. I didn't do it because it made the patch slightly
more invasive... The complexity really is only needed because we're not
guaranteed that 64bit reads are atomic. Although we actually can be
sure, because there's no platform with nonatomic intptr_t reads - so
64bit platforms actually *do* have atomic 64bit reads/writes.

So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
don't even need to take a lock to set it. Afaics the worst that can
happen is that several processes set the latch...

OK, that design is fine with me.

There's a slight problem with this, namely restarts of bgwriter. If it
crashes the reference to the relevant latch will currently be reset via
StrategyNotifyBgWriter(). In reality that's not a problem because
sizeof(void*) writes are always atomic, but currently we don't assume
that. We'd sometimes write to a old latch, but that's harmless because
they're never deallocated.

So I can see several solutions right now:
1) Redefine our requirements so that aligned sizeof(void*) writes are
always atomic. That doesn't affect any currently supported platform
and won't ever affect any future platform either. E.g. linux has had
that requirement for a long time.
2) Use a explicitly defined latch for the bgworker instead of using the
PGPROC->procLatch. That way it never has to be reset and all
SetLatch()s will eventually go to the right process.
3) Continue requiring the spinlock when setting the latch.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: Lockless StrategyGetBuffer() clock sweep

On Mon, Dec 8, 2014 at 2:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-30 07:55:08 -0400, Robert Haas wrote:

On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:

But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Yea, that'd be nicer. I didn't do it because it made the patch slightly
more invasive... The complexity really is only needed because we're not
guaranteed that 64bit reads are atomic. Although we actually can be
sure, because there's no platform with nonatomic intptr_t reads - so
64bit platforms actually *do* have atomic 64bit reads/writes.

So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
don't even need to take a lock to set it. Afaics the worst that can
happen is that several processes set the latch...

OK, that design is fine with me.

There's a slight problem with this, namely restarts of bgwriter. If it
crashes the reference to the relevant latch will currently be reset via
StrategyNotifyBgWriter(). In reality that's not a problem because
sizeof(void*) writes are always atomic, but currently we don't assume
that. We'd sometimes write to a old latch, but that's harmless because
they're never deallocated.

So I can see several solutions right now:
1) Redefine our requirements so that aligned sizeof(void*) writes are
always atomic. That doesn't affect any currently supported platform
and won't ever affect any future platform either. E.g. linux has had
that requirement for a long time.
2) Use a explicitly defined latch for the bgworker instead of using the
PGPROC->procLatch. That way it never has to be reset and all
SetLatch()s will eventually go to the right process.
3) Continue requiring the spinlock when setting the latch.

Maybe you could store the pgprocno instead of the PROC *.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#9)
1 attachment(s)
Re: Lockless StrategyGetBuffer() clock sweep

On 2014-12-10 21:52:17 -0500, Robert Haas wrote:

Maybe you could store the pgprocno instead of the PROC *.

That's a good idea. Here's a patch implementing that and other things.

Changes:
* The handling of wraparound is slight changed. There's a separate patch
for the case where nextVictimBuffer is above NBuffers. That allows a)
to avoid the somewhat expensive modulo operation in the common case b)
always consistent results for StrategySyncStart()
* StrategySyncStart() doesn't have a situation in which it can return
inaccurate information anymore. That could actually trigger an
assertion bgwriter.
* There was a bug because the local victim variable was signed instead
of unsigned.
* Clock sweep ticks are moved into a separate routine.

Comments?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Lockless-StrategyGetBuffer-clock-sweep-hotpath.patchtext/x-patch; charset=us-asciiDownload
>From b757437498365f1f8f76fe878ec263e25cb8347f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 23 Dec 2014 21:30:14 +0100
Subject: [PATCH] Lockless StrategyGetBuffer clock sweep hotpath.

StrategyGetBuffer() has proven to be a bottleneck in a number of
buffer acquiration heavy workloads. To some degree this has already
been aleviated by 5d7962c6, but it's still can be quite a heavy
bottleneck.  The problem is that in unfortunate usage patterns a
single StrategyGetBuffer() call will have to look at a large number of
buffers - in turn making it likely that the process will be put to
sleep while still holding the spinlock.

Replace most of the buffer_strategy_lock for the clock sweep by a
atomic nextVictimBuffer variable. That variable, modulo NBuffers, is
the current hand of the clock sweep. The buffer clocksweep only needs
to acquire the spinlock after a wraparound. And even then, only in the
process wrapping around. That aleviates nearly all the contention in
the relevant spinlock.

Reviewed-By: Robert Haas and Amit Kapila
---
 src/backend/postmaster/bgwriter.c     |   4 +-
 src/backend/storage/buffer/freelist.c | 253 ++++++++++++++++++++++++----------
 src/include/storage/buf_internals.h   |   2 +-
 3 files changed, 180 insertions(+), 79 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..101854a 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -373,13 +373,13 @@ BackgroundWriterMain(void)
 		if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
 		{
 			/* Ask for notification at next buffer allocation */
-			StrategyNotifyBgWriter(&MyProc->procLatch);
+			StrategyNotifyBgWriter(MyProc->pgprocno);
 			/* Sleep ... */
 			rc = WaitLatch(&MyProc->procLatch,
 						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
 						   BgWriterDelay * HIBERNATE_FACTOR);
 			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(NULL);
+			StrategyNotifyBgWriter(-1);
 		}
 
 		/*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5966beb..eb01811 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,8 +15,12 @@
  */
 #include "postgres.h"
 
+#include "port/atomics.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
+
+#define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)&(var))))
 
 
 /*
@@ -27,8 +31,12 @@ typedef struct
 	/* Spinlock: protects the values below */
 	slock_t		buffer_strategy_lock;
 
-	/* Clock sweep hand: index of next buffer to consider grabbing */
-	int			nextVictimBuffer;
+	/*
+	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
+	 * this isn't a concrete buffer - we only ever increase the value. So, to
+	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 */
+	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
@@ -42,13 +50,14 @@ typedef struct
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
 	 */
-	uint32		completePasses; /* Complete cycles of the clock sweep */
-	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
+	uint32		 completePasses; /* Complete cycles of the clock sweep */
+	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Bgworker process to be notified upon activity or -1 if none. See
+	 * StrategyNotifyBgWriter.
 	 */
-	Latch	   *bgwriterLatch;
+	int			bgwprocno;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -93,6 +102,70 @@ static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 				volatile BufferDesc *buf);
 
+/*
+ * ClockSweepTick - Helper routine for StrategyGetBuffer()
+ *
+ * Move the clock hand one buffer ahead of its current position and return the
+ * id of the buffer now under the hand.
+ */
+static inline uint32
+ClockSweepTick(void)
+{
+	uint32 victim;
+
+	/*
+	 * Atomically move hand ahead one buffer - if there's several processes
+	 * doing this, this can lead to buffers being returned slightly out of
+	 * apparent order.
+	 */
+	victim =
+		pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+
+	if (victim >= NBuffers)
+	{
+		uint32 originalVictim = victim;
+
+		/* always wrap what we look up in BufferDescriptors */
+		victim = victim % NBuffers;
+
+		/*
+		 * If we're the one that just caused a wraparound, force
+		 * completePasses to be incremented while holding the spinlock. We
+		 * need the spinlock so StrategySyncStart() can return a consistent
+		 * value consisting of nextVictimBuffer and completePasses.
+		 */
+		if (victim == 0)
+		{
+			uint32 expected;
+			uint32 wrapped;
+			bool success = false;
+
+			expected = originalVictim + 1;
+
+			while (!success)
+			{
+				/*
+				 * Acquire the spinlock while increasing completePasses. That
+				 * allows other readers to read nextVictimBuffer and
+				 * completePasses in a consistent manner which is required for
+				 * StrategySyncStart().  In theory delaying the increment
+				 * could lead to a overflow of nextVictimBuffers, but that's
+				 * highly unlikely and wouldn't be particularly harmful.
+				 */
+				SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+				wrapped = expected % NBuffers;
+
+				success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
+														 &expected, wrapped);
+				if (success)
+					StrategyControl->completePasses++;
+				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			}
+		}
+	}
+	return victim;
+}
 
 /*
  * StrategyGetBuffer
@@ -110,7 +183,7 @@ volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy)
 {
 	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
+	int			bgwprocno;
 	int			trycounter;
 
 	/*
@@ -124,86 +197,107 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			return buf;
 	}
 
-	/* Nope, so lock the freelist */
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	/*
+	 * If asked, we need to waken the bgwriter. Since we don't want to rely on
+	 * a spinlock for this we force a read from shared memory once, and then
+	 * set the latch based on that value. We need to go through that length
+	 * because otherwise bgprocno might be reset while/after we check because
+	 * the compiler might just reread from memory.
+	 *
+	 * This can possibly set the latch of the wrong process if the bgwriter
+	 * dies in the wrong moment. But since PGPROC->procLatch is never
+	 * deallocated the worst consequence of that is that we set the latch of
+	 * some arbitrary process.
+	 */
+	bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);
+	if (bgwprocno != -1)
+	{
+		/* reset bgwprocno first, before setting the latch */
+		StrategyControl->bgwprocno = -1;
+		pg_write_barrier();
+
+		/*
+		 * Not acquiring ProcArrayLock here which is slightly icky. It's
+		 * actually fine because procLatch isn't ever freed, so we just can
+		 * potentially set the wrong process' (or no process') latch.
+		 */
+		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
+	}
 
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	StrategyControl->numBufferAllocs++;
+	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding buffer_strategy_lock; so release and re-grab.
-	 * This is annoyingly tedious, but it happens at most once per bgwriter
-	 * cycle, so the performance hit is minimal.
+	 * First check, without acquiring the lock, wether there's buffers in the
+	 * freelist. Since we otherwise don't require the spinlock in every
+	 * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
+	 * uselessly in most cases. That obviously leaves a race where a buffer is
+	 * put on the freelist but we don't see the store yet - but that's pretty
+	 * harmless, it'll just get used during the next buffer acquiration.
+	 *
+	 * If there's buffers on the freelist, acquire the spinlock to pop one
+	 * buffer of the freelist. Then check whether that buffer is usable and
+	 * repeat if not.
+	 *
+	 * Note that the freeNext fields are considered to be protected by the
+	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
+	 * manipulate them without holding the spinlock.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
-		StrategyControl->bgwriterLatch = NULL;
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-		SetLatch(bgwriterLatch);
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	}
+		while (true)
+		{
+			/* Acquire the spinlock to remove element from the freelist */
+			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the buffer_strategy_lock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
-	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+			if (StrategyControl->firstFreeBuffer < 0)
+			{
+				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				break;
+			}
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * Release the lock so someone else can access the freelist (or run
-		 * the clocksweep) while we check out this buffer.
-		 */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
-		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
-		}
-		UnlockBufHdr(buf);
+			/*
+			 * Release the lock so someone else can access the freelist while
+			 * we check out this buffer.
+			 */
+			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 
-		/* Reacquire the lock and go around for another pass. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * use it; discard it and retry.  (This can only happen if VACUUM
+			 * put a valid buffer in the freelist and then someone else used
+			 * it before we got to it.  It's probably impossible altogether as
+			 * of 8.3, but we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+		}
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
-
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
 
-		/* Release the lock before manipulating the candidate buffer. */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+		buf = &BufferDescriptors[ClockSweepTick()];
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -238,9 +332,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			elog(ERROR, "no unpinned buffers available");
 		}
 		UnlockBufHdr(buf);
-
-		/* Reacquire the lock and get a new candidate buffer. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 }
 
@@ -281,16 +372,26 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 int
 StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
+	uint32		nextVictimBuffer;
 	int			result;
 
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	result = StrategyControl->nextVictimBuffer;
+	nextVictimBuffer = pg_atomic_read_u32(&StrategyControl->nextVictimBuffer);
+	result = nextVictimBuffer % NBuffers;
+
 	if (complete_passes)
+	{
 		*complete_passes = StrategyControl->completePasses;
+		/*
+		 * Additionally add the number of wraparounds that happened before
+		 * completePasses could be incremented. C.f. ClockSweepTick().
+		 */
+		*complete_passes += nextVictimBuffer / NBuffers;
+	}
+
 	if (num_buf_alloc)
 	{
-		*num_buf_alloc = StrategyControl->numBufferAllocs;
-		StrategyControl->numBufferAllocs = 0;
+		*num_buf_alloc = pg_atomic_exchange_u32(&StrategyControl->numBufferAllocs, 0);
 	}
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
@@ -305,7 +406,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
  * from hibernation, and is not meant for anybody else to use.
  */
 void
-StrategyNotifyBgWriter(Latch *bgwriterLatch)
+StrategyNotifyBgWriter(int bgwprocno)
 {
 	/*
 	 * We acquire buffer_strategy_lock just to ensure that the store appears
@@ -313,7 +414,7 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	StrategyControl->bgwriterLatch = bgwriterLatch;
+	StrategyControl->bgwprocno = bgwprocno;
 	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
@@ -389,14 +490,14 @@ StrategyInitialize(bool init)
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
 
 		/* Initialize the clock sweep pointer */
-		StrategyControl->nextVictimBuffer = 0;
+		pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
 
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
-		StrategyControl->numBufferAllocs = 0;
+		pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
 
 		/* No pending notification */
-		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgwprocno = -1;
 	}
 	else
 		Assert(!init);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0e69b63..7cf37b6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -191,7 +191,7 @@ extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
-extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyNotifyBgWriter(int bgwprocno);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
-- 
2.2.0.rc0.18.ga1ad247

#11Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#10)
Re: Lockless StrategyGetBuffer() clock sweep

On Tue, Dec 23, 2014 at 3:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-12-10 21:52:17 -0500, Robert Haas wrote:

Maybe you could store the pgprocno instead of the PROC *.

That's a good idea. Here's a patch implementing that and other things.

Thanks. Cool.

Changes:
* The handling of wraparound is slight changed. There's a separate patch
for the case where nextVictimBuffer is above NBuffers. That allows a)
to avoid the somewhat expensive modulo operation in the common case b)
always consistent results for StrategySyncStart()
* StrategySyncStart() doesn't have a situation in which it can return
inaccurate information anymore. That could actually trigger an
assertion bgwriter.
* There was a bug because the local victim variable was signed instead
of unsigned.
* Clock sweep ticks are moved into a separate routine.

Comments?

You need to spell-check your commit message. My spell-checker
complains about acquiration, aleviated, aleviates, and clocksweep.
The first is not a word at all; perhaps you meant acquisition. The
next two, I believe, need a double-l rather than a single-l.
"acquiration" also appears in the text of the patch, along with
"wether" (should be "whether"). And the last one should be two words.

I don't think I have anything to say about the substance of the patch.
If fetch-and-add is faster than a spinlock cycle, then it is. And
it's good to be fast.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#11)
Re: Lockless StrategyGetBuffer() clock sweep

On 2014-12-23 16:42:41 -0500, Robert Haas wrote:

I don't think I have anything to say about the substance of the patch.
If fetch-and-add is faster than a spinlock cycle, then it is. And
it's good to be fast.

I don't think the primary advantage is that it's fast (even though it
should be as fast as a single TAS on x86). It's that you can never sleep
while holding the spinlock when there's no such spinlock and that
everytime you transfer the cacheline from another cpu to you you'll also
make progress...

Will fix the other stuff.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers