Scaling shared buffer eviction

Started by Amit Kapilaover 11 years ago132 messages

amit.kapila16@gmail.com

over 11 years ago

1 attachment(s)

As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.

The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html

To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:

M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S postgres

LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023

Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.

To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com

LWLOCK_STATS data after Patch (test used is same as
used for HEAD):

BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.

Performance Data
-------------------------------
shared_buffers= 8GB
number of threads - 64
sc - scale factor

sc tps
Head 3000 45569
Patch 3000 46457
Head 1000 93037
Patch 1000 92711

Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.

I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.

Data With perf record -a -g
-----------------------------------------

+  10.14%          swapper  [kernel.kallsyms]      [k]
.pseries_dedicated_idle_sleep
+   7.77%         postgres  [kernel.kallsyms]      [k] ._raw_spin_lock
+   6.88%         postgres  [kernel.kallsyms]      [k]
.function_trace_call
+   4.15%          pgbench  [kernel.kallsyms]      [k] .try_to_wake_up
+   3.20%          swapper  [kernel.kallsyms]      [k]
.function_trace_call
+   2.99%          pgbench  [kernel.kallsyms]      [k]
.function_trace_call
+   2.41%         postgres  postgres               [.] AllocSetAlloc
+   2.38%         postgres  [kernel.kallsyms]      [k] .try_to_wake_up
+   2.27%          pgbench  [kernel.kallsyms]      [k] ._raw_spin_lock
+   1.49%         postgres  [kernel.kallsyms]      [k]
._raw_spin_lock_irq
+   1.36%         postgres  postgres               [.]
AllocSetFreeIndex
+   1.09%          swapper  [kernel.kallsyms]      [k] ._raw_spin_lock
+   0.91%         postgres  postgres               [.] GetSnapshotData
+   0.90%         postgres  postgres               [.]
MemoryContextAllocZeroAligned

Expanded graph
------------------------------

- 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
- .pseries_dedicated_idle_sleep
- 10.13% .pseries_dedicated_idle_sleep
- 10.13% .cpu_idle
- 10.00% .start_secondary
.start_secondary_prolog
- 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
- ._raw_spin_lock
- 6.63% ._raw_spin_lock
- 5.95% .double_rq_lock
- .load_balance
- 5.95% .__schedule
- .schedule
- 3.27% .SyS_semtimedop
.SyS_ipc
syscall_exit
semop
PGSemaphoreLock
LWLockAcquireCommon
- LWLockAcquire
- 3.27% BufferAlloc
ReadBuffer_common
- ReadBufferExtended
- 3.27% ReadBuffer
- 2.73% ReleaseAndReadBuffer
- 1.70% _bt_relandgetbuf
_bt_search
_bt_first
btgettuple

It shows BufferAlloc->LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

Details of patch
------------------------
1. Changed bgwriter to move buffers (having usage_count as zero)
on free list based on threshold (high_watermark) and decrement the
usage count if usage_count is greater than zero.
2. StrategyGetBuffer() will wakeup bgwriter when the number of
buffers in freelist drop under low_watermark.
Currently I am using hard-coded values, we can choose to make
them as configurable later on if required.
3. Work done to get a buffer from freelist is done under spin lock
and clock sweep still runs under BufFreelistLock.

This is still a WIP patch and some of the changes are just kind
of prototype to check the idea, like I have hacked bgwriter code
such that it continuously fills the freelist till it is able to put
enough buffers on freelist such that it reaches high_watermark
and commented some part of previous code.

Thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v1.patchapplication/octet-stream; name=scalable_buffer_eviction_v1.patchDownload

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..f2804f1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -252,12 +252,19 @@ BackgroundWriterMain(void)
 	prev_hibernate = false;
 
 	/*
+	 * Initialize the freelist latch. ToDo, this needs to be done under
+	 * spinlock which will be used to protect freelist.
+	 */
+
+	StrategyInitFreeListLatch(&MyProc->procLatch);
+
+	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		can_hibernate;
-		int			rc;
+		bool		can_hibernate = 0;
+		int			rc = 0;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(&MyProc->procLatch);
@@ -281,7 +288,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync();
+		/*can_hibernate = BgBufferSync();		*/
 
 		/*
 		 * Send off activity statistics to the stats collector
@@ -339,6 +346,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Sleep untill signalled by backend.
+		 */
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
+
+		BgBufferSyncAndMoveBuffersToFreelist();
+
+
+		/*
 		 * Sleep until we are signaled or BgWriterDelay has elapsed.
 		 *
 		 * Note: the feedback control loop in BgBufferSync() expects that we
@@ -348,9 +363,9 @@ BackgroundWriterMain(void)
 		 * down with latch events that are likely to happen frequently during
 		 * normal operation.
 		 */
-		rc = WaitLatch(&MyProc->procLatch,
+		/*rc = WaitLatch(&MyProc->procLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   BgWriterDelay /* ms */ );
+					   BgWriterDelay  ms  );*/
 
 		/*
 		 * If no latch event and BgBufferSync says nothing's happening, extend
@@ -370,17 +385,17 @@ BackgroundWriterMain(void)
 		 * for two consecutive cycles.  Also, we mitigate any possible
 		 * consequences of a missed wakeup by not hibernating forever.
 		 */
-		if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
-		{
+		/*if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
+		{*/
 			/* Ask for notification at next buffer allocation */
-			StrategyNotifyBgWriter(&MyProc->procLatch);
+			/*StrategyNotifyBgWriter(&MyProc->procLatch);*/
 			/* Sleep ... */
-			rc = WaitLatch(&MyProc->procLatch,
+			/*rc = WaitLatch(&MyProc->procLatch,
 						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-						   BgWriterDelay * HIBERNATE_FACTOR);
+						   BgWriterDelay * HIBERNATE_FACTOR);*/
 			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(NULL);
-		}
+			/*StrategyNotifyBgWriter(NULL);
+		}*/
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..7d4efed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1635,6 +1635,41 @@ BgBufferSync(void)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+	uint32	next_to_clean;
+	uint32	num_to_free;
+	int			num_written;
+	volatile BufferDesc *bufHdr;
+
+	StrategySyncStartAndEnd(&next_to_clean, &num_to_free);
+
+	/* Make sure we can handle the pin inside SyncOneBuffer */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	num_written = 0;
+
+	/* Execute the LRU scan */
+	while (num_to_free > 0)
+	{
+		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+
+		bufHdr = &BufferDescriptors[next_to_clean];
+		if (++next_to_clean >= NBuffers)
+			next_to_clean = 0;
+		if (buffer_state & BUF_WRITTEN)
+			++num_written;
+		if (buffer_state & BUF_REUSABLE)
+		{
+			if (StrategyMoveBufferToFreeListEnd (bufHdr))
+				num_to_free--;
+		}
+	}
+
+	BgWriterStats.m_buf_written_clean += num_written;
+}
+
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
@@ -1673,6 +1708,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	else if (skip_recently_used)
 	{
 		/* Caller told us not to write recently-used buffers */
+		if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+			bufHdr->usage_count--;
 		UnlockBufHdr(bufHdr);
 		return result;
 	}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..90e3f40 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -42,6 +43,10 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	Latch	   *freelistLatch;  /* Latch to wake bgwriter */
+	/* protects freelist variables (firstFreeBuffer, lastFreeBuffer, numFreeListBuffers, BufferDesc->freeNext)*/
+	slock_t	     freelist_lck;
+
 	/*
 	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
 	 */
@@ -112,7 +117,6 @@ volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 {
 	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
 	int			trycounter;
 
 	/*
@@ -129,31 +133,16 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
-	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
-	 */
-	StrategyControl->numBufferAllocs++;
+	*lock_held = false;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * ideally numFreeListBuffers should get called under freelist
+	 * spinlock, however here we need this number for estimating
+	 * approximate number of free buffers required on freelist,
+	 * so it would be okay, even if numFreeListBuffers is not exact.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
+	if (StrategyControl->numFreeListBuffers < 200)
+		SetLatch(StrategyControl->freelistLatch);
 
 	/*
 	 * Try to get a buffer from the freelist.  Note that the freeNext fields
@@ -161,34 +150,51 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 	 * individual buffer spinlocks, so it's OK to manipulate them without
 	 * holding the spinlock.
 	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	for(;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		if (StrategyControl->firstFreeBuffer >= 0)
+		{
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
+			--StrategyControl->numFreeListBuffers;
+
+			SpinLockRelease(&StrategyControl->freelist_lck);
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+		}
+		else
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			break;
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
+	*lock_held = true;
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
 	for (;;)
 	{
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +202,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
 		{
 			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			/*StrategyControl->completePasses++;*/
 		}
 
 		/*
@@ -241,7 +247,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,11 +259,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
 }
 
+
 /*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
@@ -287,6 +332,31 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 	return result;
 }
 
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end)
+{
+	int			curfreebuffers;
+	int			reqfreebuffers;
+
+	/*
+	 * ideally numFreeListBuffers should get called under
+	 * freelist spin lock, however here we need this number for
+	 * estimating approximate number of free buffers required
+	 * on freelist, so it would be okay, even if numFreeListBuffers is not exact.
+	 */
+
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	*start = StrategyControl->nextVictimBuffer;
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	reqfreebuffers = 2000;
+	if (reqfreebuffers > curfreebuffers)
+		*end = reqfreebuffers - curfreebuffers;
+	else
+		*end = 0;
+	LWLockRelease(BufFreelistLock);
+	return;
+}
+
 /*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
@@ -309,6 +379,19 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
 }
 
 
+void
+StrategyInitFreeListLatch(Latch *bgwriterLatch)
+{
+	/*
+	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
+	 * infrequently, so there's no performance penalty from being safe.
+	 */
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->freelistLatch= bgwriterLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
 /*
  * StrategyShmemSize
  *
@@ -376,6 +459,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -386,6 +470,8 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->freelistLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
 	}
 	else
 		Assert(!init);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..05ff723 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -188,11 +188,14 @@ extern BufferDesc *LocalBufferDescriptors;
 extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 				  bool *lock_held);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void     StrategySyncStartAndEnd(uint32 *start, uint32 *end);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitFreeListLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#1)

1 attachment(s)

Re: Scaling shared buffer eviction

On Thu, May 15, 2014 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks

PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22

This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.

As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.

To reduce the contention around BufMappingLocks, I have tried the patch
by just increasing the Number of Buffer Partitions, and it actually shows
a really significant increase in scalability both due to reduced contention
around BufFreeListLock and BufMappingLocks. The real effect of reducing
contention around BufFreeListLock was hidden because the whole contention
was shifted to BufMappingLocks. I have taken performance data for both
HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the
benefit of reducing contention around BufFreeListLock. This data has been
taken using pgbench read only load (Select).

Performance Data
-------------------------------
HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
Similarly 128 means 128 buffer partitions

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

shared_buffers= 8GB
scale factor = 1000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822

Observations
-------------------------
1. There is increase of upto 5 times in performance for data that can fit
in memory but not in shared buffers
2. Though there is a increase in performance by just increasing number
of buffer partitions, but it doesn't scale well (especially see the case
when partitions have increased to 128 from 64).

I have verified that contention has reduced around BufMappingLocks
by running the patch with LWLOCKS

BufFreeListLock
PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1

BufMappingLocks

PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0
PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0
PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0
PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0
..
..
PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0
PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0
PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0
PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0

We can clearly notice that the number for *blk* has reduced significantly
which shows that contention has reduced.

The patch is still in a shape to prove the merit of idea and I have just
changed the number of partitions so that if someone wants to verify
the performance for similar load, it can be done by just applying
the patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v2.patchapplication/octet-stream; name=scalable_buffer_eviction_v2.patchDownload

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..f2804f1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -252,12 +252,19 @@ BackgroundWriterMain(void)
 	prev_hibernate = false;
 
 	/*
+	 * Initialize the freelist latch. ToDo, this needs to be done under
+	 * spinlock which will be used to protect freelist.
+	 */
+
+	StrategyInitFreeListLatch(&MyProc->procLatch);
+
+	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		can_hibernate;
-		int			rc;
+		bool		can_hibernate = 0;
+		int			rc = 0;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(&MyProc->procLatch);
@@ -281,7 +288,7 @@ BackgroundWriterMain(void)
 		/*
 		 * Do one cycle of dirty-buffer writing.
 		 */
-		can_hibernate = BgBufferSync();
+		/*can_hibernate = BgBufferSync();		*/
 
 		/*
 		 * Send off activity statistics to the stats collector
@@ -339,6 +346,14 @@ BackgroundWriterMain(void)
 		}
 
 		/*
+		 * Sleep untill signalled by backend.
+		 */
+		WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
+
+		BgBufferSyncAndMoveBuffersToFreelist();
+
+
+		/*
 		 * Sleep until we are signaled or BgWriterDelay has elapsed.
 		 *
 		 * Note: the feedback control loop in BgBufferSync() expects that we
@@ -348,9 +363,9 @@ BackgroundWriterMain(void)
 		 * down with latch events that are likely to happen frequently during
 		 * normal operation.
 		 */
-		rc = WaitLatch(&MyProc->procLatch,
+		/*rc = WaitLatch(&MyProc->procLatch,
 					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   BgWriterDelay /* ms */ );
+					   BgWriterDelay  ms  );*/
 
 		/*
 		 * If no latch event and BgBufferSync says nothing's happening, extend
@@ -370,17 +385,17 @@ BackgroundWriterMain(void)
 		 * for two consecutive cycles.  Also, we mitigate any possible
 		 * consequences of a missed wakeup by not hibernating forever.
 		 */
-		if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
-		{
+		/*if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
+		{*/
 			/* Ask for notification at next buffer allocation */
-			StrategyNotifyBgWriter(&MyProc->procLatch);
+			/*StrategyNotifyBgWriter(&MyProc->procLatch);*/
 			/* Sleep ... */
-			rc = WaitLatch(&MyProc->procLatch,
+			/*rc = WaitLatch(&MyProc->procLatch,
 						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-						   BgWriterDelay * HIBERNATE_FACTOR);
+						   BgWriterDelay * HIBERNATE_FACTOR);*/
 			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(NULL);
-		}
+			/*StrategyNotifyBgWriter(NULL);
+		}*/
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..7d4efed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1635,6 +1635,41 @@ BgBufferSync(void)
 	return (bufs_to_lap == 0 && recent_alloc == 0);
 }
 
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+	uint32	next_to_clean;
+	uint32	num_to_free;
+	int			num_written;
+	volatile BufferDesc *bufHdr;
+
+	StrategySyncStartAndEnd(&next_to_clean, &num_to_free);
+
+	/* Make sure we can handle the pin inside SyncOneBuffer */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	num_written = 0;
+
+	/* Execute the LRU scan */
+	while (num_to_free > 0)
+	{
+		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+
+		bufHdr = &BufferDescriptors[next_to_clean];
+		if (++next_to_clean >= NBuffers)
+			next_to_clean = 0;
+		if (buffer_state & BUF_WRITTEN)
+			++num_written;
+		if (buffer_state & BUF_REUSABLE)
+		{
+			if (StrategyMoveBufferToFreeListEnd (bufHdr))
+				num_to_free--;
+		}
+	}
+
+	BgWriterStats.m_buf_written_clean += num_written;
+}
+
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
@@ -1673,6 +1708,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 	else if (skip_recently_used)
 	{
 		/* Caller told us not to write recently-used buffers */
+		if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+			bufHdr->usage_count--;
 		UnlockBufHdr(bufHdr);
 		return result;
 	}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..90e3f40 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -42,6 +43,10 @@ typedef struct
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	Latch	   *freelistLatch;  /* Latch to wake bgwriter */
+	/* protects freelist variables (firstFreeBuffer, lastFreeBuffer, numFreeListBuffers, BufferDesc->freeNext)*/
+	slock_t	     freelist_lck;
+
 	/*
 	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
 	 */
@@ -112,7 +117,6 @@ volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 {
 	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
 	int			trycounter;
 
 	/*
@@ -129,31 +133,16 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
-	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
-	 */
-	StrategyControl->numBufferAllocs++;
+	*lock_held = false;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * ideally numFreeListBuffers should get called under freelist
+	 * spinlock, however here we need this number for estimating
+	 * approximate number of free buffers required on freelist,
+	 * so it would be okay, even if numFreeListBuffers is not exact.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
+	if (StrategyControl->numFreeListBuffers < 200)
+		SetLatch(StrategyControl->freelistLatch);
 
 	/*
 	 * Try to get a buffer from the freelist.  Note that the freeNext fields
@@ -161,34 +150,51 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 	 * individual buffer spinlocks, so it's OK to manipulate them without
 	 * holding the spinlock.
 	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	for(;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		if (StrategyControl->firstFreeBuffer >= 0)
+		{
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
+			--StrategyControl->numFreeListBuffers;
+
+			SpinLockRelease(&StrategyControl->freelist_lck);
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+		}
+		else
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			break;
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
+	*lock_held = true;
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
 	for (;;)
 	{
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +202,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
 		{
 			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			/*StrategyControl->completePasses++;*/
 		}
 
 		/*
@@ -241,7 +247,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,11 +259,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
 }
 
+
 /*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
@@ -287,6 +332,31 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 	return result;
 }
 
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end)
+{
+	int			curfreebuffers;
+	int			reqfreebuffers;
+
+	/*
+	 * ideally numFreeListBuffers should get called under
+	 * freelist spin lock, however here we need this number for
+	 * estimating approximate number of free buffers required
+	 * on freelist, so it would be okay, even if numFreeListBuffers is not exact.
+	 */
+
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	*start = StrategyControl->nextVictimBuffer;
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	reqfreebuffers = 2000;
+	if (reqfreebuffers > curfreebuffers)
+		*end = reqfreebuffers - curfreebuffers;
+	else
+		*end = 0;
+	LWLockRelease(BufFreelistLock);
+	return;
+}
+
 /*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
@@ -309,6 +379,19 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
 }
 
 
+void
+StrategyInitFreeListLatch(Latch *bgwriterLatch)
+{
+	/*
+	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
+	 * infrequently, so there's no performance penalty from being safe.
+	 */
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->freelistLatch= bgwriterLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
 /*
  * StrategyShmemSize
  *
@@ -376,6 +459,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -386,6 +470,8 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->freelistLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
 	}
 	else
 		Assert(!init);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..05ff723 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -188,11 +188,14 @@ extern BufferDesc *LocalBufferDescriptors;
 extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 				  bool *lock_held);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void     StrategySyncStartAndEnd(uint32 *start, uint32 *end);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitFreeListLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 175fae3..fe86e07 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,10 +136,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+#define LOG2_NUM_LOCK_PARTITIONS  7
 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
 
 /* Number of partitions the shared predicate lock tables are divided into */

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#2)

Re: Scaling shared buffer eviction

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think it
would be better to replace the whole buffer mapping table with something
different. I started working on that almost 2 years ago, building a
hash-table that can be read without requiring any locks and written with,
well, less locking than what we have right now:

http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash

I never got quite as far as trying to hook that up to the buffer mapping
machinery, but maybe that would be worth doing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Geoghegan

pg@heroku.com

over 11 years ago

In reply to: Amit Kapila (#2)

Re: Scaling shared buffer eviction

On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

shared_buffers= 8GB
scale factor = 1000
RAM - 64GB

Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822

I'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run? What does "Thrds" denote?

--
Peter Geoghegan

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Peter Geoghegan (#4)

Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

shared_buffers= 8GB
scale factor = 3000
RAM - 64GB

I'm having a little trouble following this. These figure are transactions

per second for a 300 second pgbench tpc-b run?

Yes, the figures are tps for a 300 second run.
It is for select-only transactions.

What does "Thrds" denote?
It denotes number of threads (-j in pgbench run)

I have used below statements to take data
./pgbench -c 64 -j 64 -T 300 -S postgres
./pgbench -c 128 -j 128 -T 300 -S postgres

The reason for posting the numbers for 64/128 threads is because we have
mainly concurrency bottleneck when the number of connections are higher
than CPU cores and I am using 16 cores, 64 hardware threads m/c.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#3)

Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't actually reviewed the code, but this sort of thing seems like

good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

In terms of ameliorating contention on the buffer mapping locks, I think

it would be better to replace the whole buffer mapping table with something
different.

Is there anything bad except for may be increase in LWLocks with scaling
hash partitions w.r.t to shared buffers either by auto tuning or by having a
configuration knob. I understand that it would be bit difficult for users
to
estimate the correct value of such a parameter, we can provide info about
its usage in docs such that if user increases shared buffers by 'X' (20
times)
of default value (128M), then consider increasing such partitions and it
should
be always power of 2 or does something similar to above internally in code.

I agree that may be even by having a reasonably good estimate of number of
partitions w.r.t shared buffers, we might not be able to eliminate the
contention
around BufMappingLocks, but I think the scalability we get by doing that is
not
bad either.

I started working on that almost 2 years ago, building a hash-table that

can be read without requiring any locks and written with, well, less
locking than what we have right now:

I have still not read the complete code, but by just going through initial
file
header, it seems to me that it will be much better than current
implementation in terms of concurrency, by the way does such an
implementation can extend to reducing scalability for hash indexes as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#3)

3 attachment(s)

Re: Scaling shared buffer eviction

On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544

I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the number
as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

e. One reason why I think number of buf-partitions is hard-coded to 16 is
that
minimum number of shared buffers allowed are 16 (128kb). However,
there
is handling in code (in function init_htab()) which ensure that even
if number
of partitions are more that shared buffers, it handles it safely.

I have checked the bgwriter CPU usage with and without patch
for various configurations and the observation is that for most of the
loads bgwriter's CPU usage after patch is between 8~20% and in
HEAD it is 0~2%. It shows that with patch when shared buffers
are under use by backends, bgwriter is constantly doing work to
ease the work of backends. Detailed data is provided later in the
mail.

Performance Data:
-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

Client Count/patch_ver (tps) 8 16 32 64 128 Head 26220 48686 70779 45232
17310 Patch 26402 50726 75574 111468 114521

Data is taken by using script (pert_buff_mgmt.sh) attached with mail.
This data is read-only pgbench data with different number of client
connections. All the numbers are in tps. This data is median of 3
5 min pgbench read-only runs. Please find the detailed data for 3 runs
in attached open office document (perf_read_scalability_data_v3.ods)

This data clearly shows that patch has improved improved performance
upto 5~6 times.

Results of BGwriter CPU usage:
--------------------------------------------------

Here sc is scale factor and sb is shared buffers and the data is
for read-only pgbench runs.

./pgbench -c 64 - j 64 -S -T 300 postgres
sc - 3000, sb - 8GB
HEAD
CPU usage - 0~2.3%
Patch v_3
CPU usage - 8.6%

sc - 100, sb - 128MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 1~2%
tps- 36199.047132
Patch v_3
CPU usage - 12~13%
tps = 109182.681827

sc - 50, sb - 75MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 0.7~2%
tps- 37760.575128
Patch v_3
CPU usage - 20~22%
tps = 106310.744198

./pgbench -c 16 - j 16 -S -T 300 postgres
sc - 100, sb - 128kb
--need to change pgbench for this.
HEAD
CPU Usage - 0~0.3%
tps- 40979.529254
Patch v_3
CPU usage - 35~40%
tps = 42956.785618

Results of LWLOCK_STATS based on low-high threshold values of freelist:
--------------------------------------------------------------------------------------------------------------

In the results, values of exacq and blk shows the contention on freelist
lock.
sc is scale factor and sb is number of shared_buffers. Below results shows
that for all except one (1MB) of configuration the contention around
buffreelist
lock is reduced significantly. For 1MB case also, it has reduced exacq
count
which shows that it has performed clock sweep lesser number of times.

sc - 3000, sb - 15GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62
Patch v_3
PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0

sc - 3000, sb - 8GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548
Patch v_3
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0

sc - 100, sb - 768MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555
Patch v-3 (lw=100,hg=1000)
PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0

sc - 100, sb - 128MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714
Patch v-3
PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0

sc - 50, sb - 75MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773
Patch v3
PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28

sc - 50, sb - 10MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864
Patch v3 (for > 1000, lw = 50 hg =200)
PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40

sc - 1, sb - 7MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119
Patch v3
PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0

sc - 1, sb - 1MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902
Patch v3
PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294

sc - 100, sb - 128kb focus(sb > 16)
./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value
of naccounts to 2500, else it was always giving no unpinned buffers
available)
HEAD
PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7
Patch v3
PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1

More Datapoints and work:
a. I have yet to take data by merging it with scalable lwlock patch of
Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).
There are many conflicts in the patch, so waiting for an updated patch.
b. Read-only data for more configurations.
c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))
d. Update docs and Remove unused code.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v3.patchapplication/octet-stream; name=scalable_buffer_eviction_v3.patchDownload

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..ae4237d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,12 +67,6 @@
 int			BgWriterDelay = 200;
 
 /*
- * Multiplier to apply to BgWriterDelay when we decide to hibernate.
- * (Perhaps this needs to be configurable?)
- */
-#define HIBERNATE_FACTOR			50
-
-/*
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
@@ -111,7 +105,6 @@ BackgroundWriterMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
-	bool		prev_hibernate;
 
 	/*
 	 * If possible, make this process a group leader, so that the postmaster
@@ -246,19 +239,15 @@ BackgroundWriterMain(void)
 	 */
 	PG_SETMASK(&UnBlockSig);
 
-	/*
-	 * Reset hibernation state after any error.
-	 */
-	prev_hibernate = false;
+	/* Initialize the freelist latch. */
+	StrategyInitBgWriterLatch(&MyProc->procLatch);
 
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		can_hibernate;
 		int			rc;
-
 		/* Clear any already-pending wakeups */
 		ResetLatch(&MyProc->procLatch);
 
@@ -279,9 +268,25 @@ BackgroundWriterMain(void)
 		}
 
 		/*
-		 * Do one cycle of dirty-buffer writing.
+		 * Sleep untill signalled by backend or LOG_SNAPSHOT_INTERVAL_MS has
+		 * elapsed.
+		 *
+		 * Backend will signal bgwriter when the number of buffers in
+		 * freelist fall below than low threshhold of freelist.  We need
+		 * to wake bgwriter after LOG_SNAPSHOT_INTERVAL_MS to ensure that
+		 * it can log information about xl_running_xacts.
 		 */
-		can_hibernate = BgBufferSync();
+		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+			rc = WaitLatch(&MyProc->procLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						   LOG_SNAPSHOT_INTERVAL_MS);
+		else
+			rc = WaitLatch(&MyProc->procLatch,
+						   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+						   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgBufferSyncAndMoveBuffersToFreelist();
 
 		/*
 		 * Send off activity statistics to the stats collector
@@ -318,7 +323,9 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if ((rc & WL_TIMEOUT || rc & WL_LATCH_SET) &&
+			XLogStandbyInfoActive() &&
+			!RecoveryInProgress())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
@@ -339,57 +346,11 @@ BackgroundWriterMain(void)
 		}
 
 		/*
-		 * Sleep until we are signaled or BgWriterDelay has elapsed.
-		 *
-		 * Note: the feedback control loop in BgBufferSync() expects that we
-		 * will call it every BgWriterDelay msec.  While it's not critical for
-		 * correctness that that be exact, the feedback loop might misbehave
-		 * if we stray too far from that.  Hence, avoid loading this process
-		 * down with latch events that are likely to happen frequently during
-		 * normal operation.
-		 */
-		rc = WaitLatch(&MyProc->procLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   BgWriterDelay /* ms */ );
-
-		/*
-		 * If no latch event and BgBufferSync says nothing's happening, extend
-		 * the sleep in "hibernation" mode, where we sleep for much longer
-		 * than bgwriter_delay says.  Fewer wakeups save electricity.  When a
-		 * backend starts using buffers again, it will wake us up by setting
-		 * our latch.  Because the extra sleep will persist only as long as no
-		 * buffer allocations happen, this should not distort the behavior of
-		 * BgBufferSync's control loop too badly; essentially, it will think
-		 * that the system-wide idle interval didn't exist.
-		 *
-		 * There is a race condition here, in that a backend might allocate a
-		 * buffer between the time BgBufferSync saw the alloc count as zero
-		 * and the time we call StrategyNotifyBgWriter.  While it's not
-		 * critical that we not hibernate anyway, we try to reduce the odds of
-		 * that by only hibernating when BgBufferSync says nothing's happening
-		 * for two consecutive cycles.  Also, we mitigate any possible
-		 * consequences of a missed wakeup by not hibernating forever.
-		 */
-		if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
-		{
-			/* Ask for notification at next buffer allocation */
-			StrategyNotifyBgWriter(&MyProc->procLatch);
-			/* Sleep ... */
-			rc = WaitLatch(&MyProc->procLatch,
-						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-						   BgWriterDelay * HIBERNATE_FACTOR);
-			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(NULL);
-		}
-
-		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
 		if (rc & WL_POSTMASTER_DEATH)
 			exit(1);
-
-		prev_hibernate = can_hibernate;
 	}
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f864816..396ac4d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5028,6 +5028,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_freelist += msg->m_buf_freelist;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..c052914 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1636,6 +1636,65 @@ BgBufferSync(void)
 }
 
 /*
+ * Write out some dirty buffers in the pool and maintain enough
+ * number of buffers in freelist (equal to high threshold for
+ * freelsit), so that backend's don't need to perform clock sweep
+ * often.
+ *
+ * This is called by the background writer process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+	uint32	next_to_clean;
+	uint32	num_to_free;
+	uint32	tmp_num_to_free;
+	uint32	recent_alloc;
+	int		num_written;
+	int		num_freelist;
+	volatile BufferDesc *bufHdr;
+
+	num_freelist = StrategySyncStartAndEnd(&next_to_clean,
+										   &num_to_free,
+										   &recent_alloc);
+
+	/* Report buffer alloc and buffer freelist counts to pgstat */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_freelist += num_freelist;
+
+	/* Make sure we can handle the pin inside SyncOneBuffer */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	num_written = 0;
+	tmp_num_to_free = num_to_free;
+
+	/* Execute the LRU scan */
+	while (tmp_num_to_free > 0)
+	{
+		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+
+		bufHdr = &BufferDescriptors[next_to_clean];
+		if (++next_to_clean >= NBuffers)
+			next_to_clean = 0;
+		if (buffer_state & BUF_WRITTEN)
+			++num_written;
+		if (buffer_state & BUF_REUSABLE)
+		{
+			if (StrategyMoveBufferToFreeListEnd (bufHdr))
+				tmp_num_to_free--;
+		}
+	}
+
+	BgWriterStats.m_buf_written_clean += num_written;
+
+#ifdef BGW_DEBUG
+	elog(LOG, "bgwriter: recent_alloc=%u num_freelist=%u wrote=%d num_freed=%u",
+		 recent_alloc, num_freelist, num_written, num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
@@ -1672,7 +1731,13 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 		result |= BUF_REUSABLE;
 	else if (skip_recently_used)
 	{
-		/* Caller told us not to write recently-used buffers */
+		/*
+		 * Caller told us not to write recently-used buffers and
+		 * reduce usage count, so that it can find the reusable
+		 * buffers in consecutive cycles.
+		 */
+		if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+			bufHdr->usage_count--;
 		UnlockBufHdr(bufHdr);
 		return result;
 	}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..3edc0e9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,7 +44,13 @@ typedef struct
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
 } BufferStrategyControl;
@@ -112,7 +119,6 @@ volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 {
 	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
 	int			trycounter;
 
 	/*
@@ -129,66 +135,78 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter can know
+	 * the rate of buffer consumption and report it as stats.  Note that
+	 * buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
+	*lock_held = false;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.  bgwriterLatch is initialized in
+	 * early phase of BgWriter startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
+	if (StrategyControl->numFreeListBuffers < freelistLowThreshold &&
+		StrategyControl->bgwriterLatch)
+		SetLatch(StrategyControl->bgwriterLatch);
 
 	/*
 	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
+	 * are considered to be protected by the freelist_lck not the
 	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
+	 * holding the buffer spinlock.
 	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	for(;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		if (StrategyControl->firstFreeBuffer >= 0)
+		{
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
+			--StrategyControl->numFreeListBuffers;
+
+			SpinLockRelease(&StrategyControl->freelist_lck);
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+		}
+		else
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			break;
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
+	*lock_held = true;
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
 	for (;;)
 	{
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +214,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
 		{
 			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			/*StrategyControl->completePasses++;*/
 		}
 
 		/*
@@ -241,7 +259,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +271,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -288,6 +345,46 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 }
 
 /*
+ * StrategySyncStartAndEnd -- tell BgWriter where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ *
+ * In addition, we return the number of of buffers on freelist.
+ */
+int
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+	int			curfreebuffers;
+
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	*start = StrategyControl->nextVictimBuffer;
+	LWLockRelease(BufFreelistLock);
+
+	/*
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.
+	 */
+
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighThreshold)
+		*end = freelistHighThreshold - curfreebuffers;
+	else
+		*end = 0;
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+
+	return curfreebuffers;
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -309,6 +406,12 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
 }
 
 
+void
+StrategyInitBgWriterLatch(Latch *bgwriterLatch)
+{
+	StrategyControl->bgwriterLatch = bgwriterLatch;
+}
+
 /*
  * StrategyShmemSize
  *
@@ -376,6 +479,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +490,42 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high threshold number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.
+	 */
+	if (NBuffers > 100000)
+	{
+		freelistLowThreshold = 200;
+		freelistHighThreshold = 2000;
+	}
+	else if (NBuffers > 10000)
+	{
+		freelistLowThreshold = 100;
+		freelistHighThreshold = 1000;
+	}
+	else if (NBuffers > 1000)
+	{
+		freelistLowThreshold = 50;
+		freelistHighThreshold = 200;
+	}
+	else if (NBuffers > 100)
+	{
+		freelistLowThreshold = 30;
+		freelistHighThreshold = 75;
+	}
+	else
+	{
+		freelistLowThreshold = 5;
+		freelistHighThreshold = 15;
+	}
 }
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d9de09f..a87954a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -389,6 +389,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_freelist;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -537,7 +538,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -662,6 +663,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_freelist;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..9eb7be6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -161,6 +161,16 @@ typedef struct sbufdesc
 #define FREENEXT_NOT_IN_LIST	(-2)
 
 /*
+ * Threshold indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgwriter wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
  * Do not apply these to local buffers!
  *
@@ -188,11 +198,15 @@ extern BufferDesc *LocalBufferDescriptors;
 extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 				  bool *lock_held);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern int  StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+									uint32 *num_buf_alloc);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgWriterLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 175fae3..fe86e07 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,10 +136,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+#define LOG2_NUM_LOCK_PARTITIONS  7
 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
 
 /* Number of partitions the shared predicate lock tables are divided into */

perf_buff_mgmt.shapplication/x-sh; name=perf_buff_mgmt.shDownload

perf_read_scalability_data_v3.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v3.odsDownload

Kevin Grittner

kgrittn@ymail.com

over 11 years ago

In reply to: Amit Kapila (#7)

Re: Scaling shared buffer eviction

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
    removed the hibernate logic as bgwriter will now work only when
    there is scarcity of buffer's in free list. Basic idea is when the
    number of buffers on freelist drops below the low threshold, the
    allocating backend sets the latch and bgwriter wakesup and begin
    adding buffer's to freelist until it reaches high threshold and then
    again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls. By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Kevin Grittner (#8)

Re: Scaling shared buffer eviction

On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls. By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter.

I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.

Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
/messages/by-id/CA+TgmoZ7dvhC4h-ffJmZCff6VWyNfOEAPZ021VxW61uH46R3QA@mail.gmail.com

However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.

Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Kevin Grittner (#8)

Re: Scaling shared buffer eviction

On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.

The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist,

Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.

Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs. In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now. If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK. In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends. I'm not sure what the right
answer is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#9)

1 attachment(s)

Re: Scaling shared buffer eviction

On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.

I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.

After taking the performance data for write load using tpc-b with the
patch, I found that there is a regression in it. So I went ahead and
tried to figure out the reason for same and found that after patch,
Bgwriter started flushing buffers which were required by backends
and reason was that *nextVictimBuffer* was not getting updated
properly while we are running clock sweep kind of logic (decrement
the usage count when number of buffers on freelist fall below low
threshhold value) in Bgwriter. In HEAD, I noticed that at default
settings, BGwriter was not at all flushing any buffers which is at least
better than what my patch was doing (flushing buffers required by
backend).

So I tried to fix the issue by updating *nextVictimBuffer* in new
BGWriter logic and results are positive.

sbe - scalable buffer eviction

Select only Data
Client count/TPS64128Un-patched4523217310sbe_v3111468114521sbe_v4153137
160752
TPC-B

Client count/TPS
64128Un-patched825784sbe_v4814845

For Select Data, I am quite confident that it will improve if we introduce
nextVictimBuffer increments in BGwriter and rather it scales much better
with that change, however for TPC-B, I am getting fluctuation in data,
so not sure it has eliminated the problem. The main difference is that in
HEAD, BGwriter never increments nextVictimBuffer during syncing the
buffers, it just notes down the current setting before start and then
proceeds sequentially.

I think it will be good if we can have a new process for moving buffers to
free list due to below reasons:

a. while trying to move buffers to freelist, it should not block due
to in between write activity.
b. The writer should not increment nextVictimBuffer and maintain
the current logic.

One significant change in this version of patch is to use a separate
spin lock to protect nextVictimBuffer rather than using BufFreelistLock.

Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v4.patchapplication/octet-stream; name=scalable_buffer_eviction_v4.patchDownload

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..ae4237d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,12 +67,6 @@
 int			BgWriterDelay = 200;
 
 /*
- * Multiplier to apply to BgWriterDelay when we decide to hibernate.
- * (Perhaps this needs to be configurable?)
- */
-#define HIBERNATE_FACTOR			50
-
-/*
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
@@ -111,7 +105,6 @@ BackgroundWriterMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
-	bool		prev_hibernate;
 
 	/*
 	 * If possible, make this process a group leader, so that the postmaster
@@ -246,19 +239,15 @@ BackgroundWriterMain(void)
 	 */
 	PG_SETMASK(&UnBlockSig);
 
-	/*
-	 * Reset hibernation state after any error.
-	 */
-	prev_hibernate = false;
+	/* Initialize the freelist latch. */
+	StrategyInitBgWriterLatch(&MyProc->procLatch);
 
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		can_hibernate;
 		int			rc;
-
 		/* Clear any already-pending wakeups */
 		ResetLatch(&MyProc->procLatch);
 
@@ -279,9 +268,25 @@ BackgroundWriterMain(void)
 		}
 
 		/*
-		 * Do one cycle of dirty-buffer writing.
+		 * Sleep untill signalled by backend or LOG_SNAPSHOT_INTERVAL_MS has
+		 * elapsed.
+		 *
+		 * Backend will signal bgwriter when the number of buffers in
+		 * freelist fall below than low threshhold of freelist.  We need
+		 * to wake bgwriter after LOG_SNAPSHOT_INTERVAL_MS to ensure that
+		 * it can log information about xl_running_xacts.
 		 */
-		can_hibernate = BgBufferSync();
+		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+			rc = WaitLatch(&MyProc->procLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						   LOG_SNAPSHOT_INTERVAL_MS);
+		else
+			rc = WaitLatch(&MyProc->procLatch,
+						   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+						   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgBufferSyncAndMoveBuffersToFreelist();
 
 		/*
 		 * Send off activity statistics to the stats collector
@@ -318,7 +323,9 @@ BackgroundWriterMain(void)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+		if ((rc & WL_TIMEOUT || rc & WL_LATCH_SET) &&
+			XLogStandbyInfoActive() &&
+			!RecoveryInProgress())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
@@ -339,57 +346,11 @@ BackgroundWriterMain(void)
 		}
 
 		/*
-		 * Sleep until we are signaled or BgWriterDelay has elapsed.
-		 *
-		 * Note: the feedback control loop in BgBufferSync() expects that we
-		 * will call it every BgWriterDelay msec.  While it's not critical for
-		 * correctness that that be exact, the feedback loop might misbehave
-		 * if we stray too far from that.  Hence, avoid loading this process
-		 * down with latch events that are likely to happen frequently during
-		 * normal operation.
-		 */
-		rc = WaitLatch(&MyProc->procLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   BgWriterDelay /* ms */ );
-
-		/*
-		 * If no latch event and BgBufferSync says nothing's happening, extend
-		 * the sleep in "hibernation" mode, where we sleep for much longer
-		 * than bgwriter_delay says.  Fewer wakeups save electricity.  When a
-		 * backend starts using buffers again, it will wake us up by setting
-		 * our latch.  Because the extra sleep will persist only as long as no
-		 * buffer allocations happen, this should not distort the behavior of
-		 * BgBufferSync's control loop too badly; essentially, it will think
-		 * that the system-wide idle interval didn't exist.
-		 *
-		 * There is a race condition here, in that a backend might allocate a
-		 * buffer between the time BgBufferSync saw the alloc count as zero
-		 * and the time we call StrategyNotifyBgWriter.  While it's not
-		 * critical that we not hibernate anyway, we try to reduce the odds of
-		 * that by only hibernating when BgBufferSync says nothing's happening
-		 * for two consecutive cycles.  Also, we mitigate any possible
-		 * consequences of a missed wakeup by not hibernating forever.
-		 */
-		if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
-		{
-			/* Ask for notification at next buffer allocation */
-			StrategyNotifyBgWriter(&MyProc->procLatch);
-			/* Sleep ... */
-			rc = WaitLatch(&MyProc->procLatch,
-						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-						   BgWriterDelay * HIBERNATE_FACTOR);
-			/* Reset the notification request in case we timed out */
-			StrategyNotifyBgWriter(NULL);
-		}
-
-		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
 		if (rc & WL_POSTMASTER_DEATH)
 			exit(1);
-
-		prev_hibernate = can_hibernate;
 	}
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3ab1428..d82667b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5020,6 +5020,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_freelist += msg->m_buf_freelist;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..5b8975b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1637,10 +1637,75 @@ BgBufferSync(void)
 }
 
 /*
+ * Write out some dirty buffers in the pool and maintain enough
+ * number of buffers in freelist (equal to high threshold for
+ * freelsit), so that backend's don't need to perform clock sweep
+ * often.
+ *
+ * This is called by the background writer process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+	volatile uint32	next_to_clean;
+	uint32	num_to_free;
+	uint32	tmp_num_to_free;
+	uint32  save_next_to_clean;
+	uint32	recent_alloc;
+	int		num_written;
+	int		num_freelist;
+	volatile BufferDesc *bufHdr;
+
+	num_freelist = StrategySyncStartAndEnd(&save_next_to_clean,
+										   &num_to_free,
+										   &recent_alloc);
+
+	/* Report buffer alloc and buffer freelist counts to pgstat */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_freelist += num_freelist;
+
+	/* Make sure we can handle the pin inside SyncOneBuffer */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	num_written = 0;
+	tmp_num_to_free = num_to_free;
+	next_to_clean = save_next_to_clean;
+
+	/* Execute the LRU scan */
+	while (tmp_num_to_free > 0)
+	{
+		int			buffer_state = SyncOneBuffer(next_to_clean, true);
+
+		bufHdr = &BufferDescriptors[next_to_clean];
+
+		/* choose next victim buffer to clean. */
+		StrategySyncNextVictimBuffer(&next_to_clean);
+		if (buffer_state & BUF_WRITTEN)
+			++num_written;
+		if (buffer_state & BUF_REUSABLE)
+		{
+			if (StrategyMoveBufferToFreeListEnd (bufHdr))
+				tmp_num_to_free--;
+		}
+	}
+
+	BgWriterStats.m_buf_written_clean += num_written;
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgwriter: recent_alloc=%u num_freelist=%u  next_to_clean=%d wrote=%d num_freed=%u",
+		 recent_alloc, num_freelist, save_next_to_clean, num_written,
+		 num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * If skip_recently_used is true, we decrement the usage count, so that
+ * we can find reusable buffers in consecutive cycles, also we don't write
+ * currently-pinned buffers, nor buffers marked recently used, as these are
+ * not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -1673,7 +1738,13 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 		result |= BUF_REUSABLE;
 	else if (skip_recently_used)
 	{
-		/* Caller told us not to write recently-used buffers */
+		/*
+		 * Caller told us not to write recently-used buffers and
+		 * reduce usage count, so that it can find the reusable
+		 * buffers in consecutive cycles.
+		 */
+		if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+			bufHdr->usage_count--;
 		UnlockBufHdr(bufHdr);
 		return result;
 	}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..358f35c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,7 +44,21 @@ typedef struct
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
+	/*
+	 * Protects nextVictimBuffer. We need separate lock to protect
+	 * victim buffer so that clock sweep of one backend doesn't
+	 * contend with another backend which is evicting buffer from
+	 * freelist.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
 } BufferStrategyControl;
@@ -112,7 +127,6 @@ volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 {
 	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
 	int			trycounter;
 
 	/*
@@ -129,76 +143,92 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter can know
+	 * the rate of buffer consumption and report it as stats.  Note that
+	 * buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
+	*lock_held = false;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.  bgwriterLatch is initialized in
+	 * early phase of BgWriter startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
 	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
+	if (StrategyControl->numFreeListBuffers < freelistLowThreshold &&
+		StrategyControl->bgwriterLatch)
+		SetLatch(StrategyControl->bgwriterLatch);
 
 	/*
 	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
+	 * are considered to be protected by the freelist_lck not the
 	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
+	 * holding the buffer spinlock.
 	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	for(;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		if (StrategyControl->firstFreeBuffer >= 0)
+		{
+			buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
 
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+			/* Unconditionally remove buffer from freelist */
+			StrategyControl->firstFreeBuffer = buf->freeNext;
+			buf->freeNext = FREENEXT_NOT_IN_LIST;
+			--StrategyControl->numFreeListBuffers;
+
+			SpinLockRelease(&StrategyControl->freelist_lck);
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+		}
+		else
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			break;
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
+	/**lock_held = true;
+	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);*/
+
 	for (;;)
 	{
+		SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
 
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
 		{
 			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
+			/*StrategyControl->completePasses++;*/
 		}
 
+		SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +271,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,11 +283,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
 }
 
+
 /*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
@@ -274,8 +343,10 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
 	if (num_buf_alloc)
@@ -283,11 +354,69 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategySyncStartAndEnd -- tell BgWriter where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ *
+ * In addition, we return the number of of buffers on freelist.
+ */
+int
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*start = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+	/*
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.
+	 */
+
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighThreshold)
+		*end = freelistHighThreshold - curfreebuffers;
+	else
+		*end = 0;
+
+	/*
+	 * We need numBufferAllocs just for statistics purpose, so getting
+	 * the number with lock.
+	 */
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+
+	return curfreebuffers;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell BgWriter which next unused
+ * buffer to look for syncing.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+		StrategyControl->nextVictimBuffer = 0;
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -309,6 +438,12 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
 }
 
 
+void
+StrategyInitBgWriterLatch(Latch *bgwriterLatch)
+{
+	StrategyControl->bgwriterLatch = bgwriterLatch;
+}
+
 /*
  * StrategyShmemSize
  *
@@ -376,6 +511,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +522,43 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high threshold number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.
+	 */
+	if (NBuffers > 100000)
+	{
+		freelistLowThreshold = 200;
+		freelistHighThreshold = 2000;
+	}
+	else if (NBuffers > 10000)
+	{
+		freelistLowThreshold = 100;
+		freelistHighThreshold = 1000;
+	}
+	else if (NBuffers > 1000)
+	{
+		freelistLowThreshold = 50;
+		freelistHighThreshold = 200;
+	}
+	else if (NBuffers > 100)
+	{
+		freelistLowThreshold = 30;
+		freelistHighThreshold = 75;
+	}
+	else
+	{
+		freelistLowThreshold = 5;
+		freelistHighThreshold = 15;
+	}
 }
 
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..2b55bca 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_freelist;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_freelist;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..54a8b8f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -161,6 +161,16 @@ typedef struct sbufdesc
 #define FREENEXT_NOT_IN_LIST	(-2)
 
 /*
+ * Threshold indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgwriter wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/*
  * Macros for acquiring/releasing a shared buffer header's spinlock.
  * Do not apply these to local buffers!
  *
@@ -188,11 +198,16 @@ extern BufferDesc *LocalBufferDescriptors;
 extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 				  bool *lock_held);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern int  StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+									uint32 *num_buf_alloc);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgWriterLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d588b14..cd26ff0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,7 +136,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4

#12

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#7)

Re: Scaling shared buffer eviction

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin

adding buffer's to freelist until it reaches high threshold and then

again goes back to sleep.

This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist()
reveals that you've removed at least one behavior that some people (at
least, me) will care about, which is the guarantee that the background
writer will scan the entire buffer pool at least every couple of minutes.
This is important because it guarantees that dirty data doesn't sit in
memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some

old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense. I think what you should be counting is
the number of allocations that are being satisfied from the free-list.
Then, by comparing the rate at which that value is incrementing to the rate
at which buffers_alloc is incrementing, somebody can figure out what
percentage of allocations are requiring a clock-sweep run. Actually, I
think it's better to flip it around: count the number of allocations that
require an individual backend to run the clock sweep (vs. being satisfied
from the free-list); call it, say, buffers_backend_clocksweep. We can then
try to tune the patch to make that number as small as possible under
varying workloads.

c. Used the already existing bgwriterLatch in BufferStrategyControl to

wake bgwriter when number of buffer's in freelist drops below
threshold.

Seems like a good idea.

d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the
number as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.

I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants. And it definitely needs some comments
explaining the logic behind the choices.

Aside from those specific remarks, I think the elephant in the room is the
question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that writing
a dirty page can take a long time on a busy system (multiple seconds) and
the freelist can be emptied much, much quicker than that (milliseconds).
Although your benchmark results show great speed-ups on read-only
workloads, we're not really going to get the benefit consistently on
read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce a
new background process, maybe with a name like bgreclaim. That process
wouldn't write dirty buffers. Instead, it would just run the clock sweep
(i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the
free list. Then, we could leave the bgwriter logic more or less intact.
It certainly needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the
LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save
the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist
if there is one, saving its identity. Release the spinlock. Then, set the
bgwriterLatch if needed. In the first loop, first check whether the buffer
we previously popped from the freelist is pinned or has a non-zero usage
count and return it if not, holding the buffer header lock. Otherwise,
reacquire the spinlock just long enough to pop a new potential victim and
then loop around.

Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#12)

Re: Scaling shared buffer eviction

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

I have kept it just for the reason that if the basic approach is
sounds reasonable/accepted, then I will clean it up. Sorry for
the inconvenience, I didn't realized that it can be annoying for
reviewer, I will remove all such code from patch in next version.

A comparison of BgBufferSync() with

BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least
one behavior that some people (at least, me) will care about, which is the
guarantee that the background writer will scan the entire buffer pool at
least every couple of minutes.

Okay, I will take care of this based on the conclusion of
the other points in this mail.

This is important because it guarantees that dirty data doesn't sit in

memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.

b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG

If I'm reading this right, the new statistic is an incrementing counter

where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.

I think what you should be counting is the number of allocations that are

being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?

d. Autotune the low and high threshold for freelist for various
configurations.

I think we need to come up with some kind of formula here rather than

just a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

There are other instances in code as well (one of them I remember
offhand is in pglz_compress) where we use fixed values based on
different sizes.

And it definitely needs some comments explaining the logic behind the

choices.

Agreed, I shall improve them in next version of patch.

Aside from those specific remarks, I think the elephant in the room is

the question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.

But the other problem, which I think is likely unsolvable, is that

writing a dirty page can take a long time on a busy system (multiple
seconds) and the freelist can be emptied much, much quicker than that
(milliseconds). Although your benchmark results show great speed-ups on
read-only workloads, we're not really going to get the benefit consistently
on read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.

I'm wondering if it would be a whole lot simpler and better to introduce

a new background process, maybe with a name like bgreclaim.

That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.

That process wouldn't write dirty buffers.

If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist. Few options to
handle such a case are:

a. Skip such a buffer; the downside is if we have to skip lot
of buffers due to this reason then having separate process
such as bgreclaim will be less advantageous.
b. Skip the buffer and notify bgwriter to flush buffers, now this
notification can be sent either as soon as we encounter one
such buffer or after few such buffers (incase of few, we need to decide
some useful number). In this option, there is a chance that bgwriter
decide not to flush buffer/'s which ideally should not happen because
I think bgwriter considers the number of recent allocations for
performing scan to flush dirt buffers.
c. Have some mechanism where bgreclaim can notify bgwriter
to flush some specific buffers. I think if we have such a mechanism
that can be later even used by backends if required.
d. Keep the logic as per current patch and improve such that it can
retain the behaviour of one cycle per two minutes as suggested above
by you on the basis that in anycase it is better than the current code.

I don't think option (d) is best way to handle this scenario,however I
kept it incase nothing else sounds reasonable. Option (c) might have
lot of work which I am not sure is justifiable to handle the current
scenario,
though it can be useful for some other things. Option (a) should be okay
for most cases, but I think option (b) would be better.

Instead, it would just run the clock sweep (i.e. the last loop inside

StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.

Then, we could leave the bgwriter logic more or less intact. It certainly

needs improvement, but that could be another patch.

Incidentally, while I generally think your changes to the locking regimen

I have kept them outside spinlock because as per patch the only
callsite for setting StrategyControl->bgwriterLatch is StrategyGetBuffer()
and StrategyControl->numBufferAllocs is used just for statistics purpose
(which I thought might be okay even if it is not accurate) whereas without
patch it is used by bgwriter for purpose other than stats as well.
However it certainly needs to be protected for separate bgreclaim process
idea or for retaining current bgwriter behaviour.

I think you should get rid of BufFreelistLock completely and just decide

that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where

the LWLock is acquired today. Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity. Release the spinlock.
Then, set the bgwriterLatch if needed. In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock. Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.

I shall take care of doing this way in next version of patch.

Under this locking strategy, StrategyNotifyBgWriter would use

freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.

Sure.

Thank you for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#13)

Re: Scaling shared buffer eviction

On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.

I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.

Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.

I think what you should be counting is the number of allocations that are
being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.

This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?

I only mean to propose one new counter, and I'd imagine including that
in the final patch. We already have a counter of total buffer
allocations; that's buffers_alloc. I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations). I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.

d. Autotune the low and high threshold for freelist for various
configurations.

I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants.

That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.

How did you go about determining the optimal value for a particular workload?

When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled. That's clearly good. But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep. My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.

I think your list of high-watermarks is far too generous for low
buffer counts. With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high. At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous. The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.

Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5. Let the low-watermark be 20% of the
high-watermark. That might not be best, but I think some kind of
formula like that can likely be made to work. I would suggest
focusing your testing on configurations with *large* settings for
shared_buffers, say 1-64GB, rather than small configurations. Anyone
who cares greatly about performance isn't going to be running with
only 8MB of shared_buffers anyway. Arguably we shouldn't even run the
reclaim process on very small configurations; I think there should
probably a GUC (PGC_SIGHUP) to control whether it gets launched.

I think it would be a good idea to analyze how frequently the reclaim
process gets woken up. In the worst case, this happens once per (high
watermark - low watermark) allocations; that is, the system reaches
the low watermark and then does no further allocations until the
reclaim process brings the freelist back up to the high watermark.
But if more allocations occur between the time the reclaim process is
woken and the time it reaches the high watermark, then it should run
for longer, until the high watermark is reached. At least for
debugging purposes, I think it would be useful to have a counter of
reclaim wakeups. I'm not sure whether that's worth including in the
final patch, but it might be.

That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.

If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist.

I thought a bit about this yesterday. I think the problem is that we
might be in a situation where buffers are being dirtied faster than
they can be cleaned. In that case, if we only put clean buffers on the
freelist, then every backend in the system will be fighting over the
ever-dwindling supply of clean buffers until, in the worst case,
there's maybe only 1 clean buffer which is getting evicted repeatedly
at top speed - or maybe even no clean buffers, and the reclaim process
just spins in an infinite loop looking for clean buffers that aren't
there.

To put that another way, the rate at which buffers are being dirtied
can't exceed the rate at which they are being cleaned forever.
Eventually, somebody is going to have to wait. Having the backends
wait by being forced to write some dirty buffers does not seem like a
bad way to accomplish that. So I favor just putting the buffers on
freelist without regard to whether they are clean or dirty. If this
turns out not to work well we can look at other options (probably some
variant of (b) from your list).

Instead, it would just run the clock sweep (i.e. the last loop inside
StrategyGetBuffer) and put the buffers onto the free list.

Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.

Yeah, that's what I meant. Of course, it should add each buffer to
the freelist individually, not batch them up and add them all at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#13)

Re: Scaling shared buffer eviction

On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#15)

Re: Scaling shared buffer eviction

On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

This essentially removes BgWriterDelay, but it's still mentioned in

BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer

called

from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what

you've

changed. I realize you probably left it that way for testing purposes,

but

you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out,

that the scope of the changes you've made is clear to reviewers.

FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.

I think I have done something wrong while replying to Robert's
mail, the main point in that mail was trying to see if there is any
major problem incase we have separate process (bgreclaim) to
populate freelist. One thing which I thought could be problematic
is to put a buf in freelist which has usage_count as zero and is *dirty*.
Please do let me know if you want clarification for something in
particular.

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

There are other things also which I need to take care as per
feedback like some change in locking strategy and code.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#16)

Re: Scaling shared buffer eviction

On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#17)

Re: Scaling shared buffer eviction

On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:

Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Doing both (populating freelist and flushing dirty buffers) via bgwriter
isn't the best way either because it might not be able to perform
both the jobs as per need.
One example is it could take much longer time to flush a dirty buffer
than to move it into free list, so if there are few buffers which we need
to flush, then I think task of maintaining buffers in freelist will get hit
even though there are buffers in list which can be moved to
free list(non-dirty buffers).
Another is maintaining the current behaviour of bgwriter which is to scan
the entire buffer pool every few mins (assuming default configuration).
We can attempt to solve this problem as suggested by Robert upthread
but I am not completely sure if that can guarantee that the current
behaviour will be retained as it is.

I am not telling that having a separate process won't have any issues,
but I think we can tackle them without changing or complicating current
bgwriter behaviour.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#12)

Re: Scaling shared buffer eviction

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Incidentally, while I generally think your changes to the locking regimen

in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Then, in StrategyGetBuffer, acquire the freelist_lck at the point where

Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I thought it is better to mention about above points so that if you have
any different thoughts about it, then it is better to discuss them now
rather than after I take performance data with this locking protocol.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Amit Kapila (#19)

Re: Scaling shared buffer eviction

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently. For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist. Perhaps
that's part of the reason we have performance issues with it.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Tom Lane (#20)

Re: Scaling shared buffer eviction

On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

In some cases, it could be beneficial especially when a,b,c are
going to be more frequently accessed as compare to x,y,z.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently.

I think patch will reduce the contention for some of such variables
(which are accessed during clock sweep) as it will minimize the need
to perform clock sweep by backends.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#22

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Tom Lane (#20)

Re: Scaling shared buffer eviction

On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.

Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.

A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).

But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently. For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.

(And yes, I realize these issues already affect the freelist. Perhaps
that's part of the reason we have performance issues with it.)

False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar. Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing. I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that. If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#19)

Re: Scaling shared buffer eviction

On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable. I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#23)

2 attachment(s)

Re: Scaling shared buffer eviction

On Wed, Aug 27, 2014 at 8:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.

Sounds reasonable. I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.

I have updated the patch to address the feedback. Main changes are:

1. For populating freelist, have a separate process (bgreclaimer)
instead of doing it by bgwriter.
2. Autotune the low and high threshold values for buffers
in freelist. I have used the formula as suggested by you upthread.
3. Cleanup of locking regimen as discussed upthread (completely
eliminated BufFreelist Lock).
4. Improved comments and general code cleanup.

I have not yet added statistics (buffers_backend_clocksweep) as
for that we need to add one more variable in BufferStrategyControl
structure where I have already added few variables for this patch.
I think it is important to have such a stat available via
pg_stat_bgwriter, but not sure if it is worth to make the structure
bit more bulky.

Another minor point is about changes in lwlock.h
lwlock.h
* if you remove a lock, consider leaving a gap in the numbering
* sequence for the benefit of DTrace and other external debugging
* scripts.

As I have removed BufFreelist lock, I have adjusted the numbering
as well in lwlock.h. There is a meesage on top of lock definitions
which suggest to leave gap if we remove any lock, however I was not
sure whether this case (removing the first element) can effect anything,
so for now, I have adjusted the numbering.

I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.

Performance Data:
-------------------------------

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

Database Locale =C

checkpoint_segments=256

checkpoint_timeout =15min

shared_buffers=8GB

scale factor = 3000

Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)

Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_ver 8 16 32 64 128 HEAD 58614 107370 140717 104357
65010 Patch 60849 118701 165631 209226 213029

Note -
a. The numbers are slightly different than previously reported
numbers as earlier I was using debug mode of binaries to take
data and it seems some kind of trace was enabled on m/c.
However the improve in performance and scalability is almost
similar to previous.
b. Above data is median of 3 runs, for detailed data refer attached
document (perf_read_scalability_data_v5.ods)

CPU Usage
------------------
I have observed that CPU usage for new process (reclaimer) is
between 5~9%.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v5.patchapplication/octet-stream; name=scalable_buffer_eviction_v5.patchDownload

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..55dc157
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).  In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.  However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
+ *
+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, BgreclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReqShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, bgreclaim_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, bgreclaim_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 */
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgwriter is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist fall below than low threshhold of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * bgreclaim_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+bgreclaim_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+BgreclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+bgreclaim_sigusr1_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b190cf5..1a34282 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4779,6 +4821,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5123,6 +5167,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..9b47eb2 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,12 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
 exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+buffers for replacement.  Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +158,18 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,19 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends.  To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist.  It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.
+
+Two threshold indicators are used to maintain sufficient number of buffers
+on freelist.  Low threshold indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist fall below it.  High threshold
+indicator is used by bgreclaimer to move buffers to freelist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 938c554..8df0eee 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -605,15 +605,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -623,10 +619,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1637,6 +1629,74 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high threshold of freelsit), we drasticaly reduce the odds for
+ * backend's to perform clock sweep.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	volatile uint32	next_to_clean;
+	uint32	num_to_free;
+	uint32	tmp_num_to_free;
+	uint32  save_next_to_clean;
+	uint32	recent_alloc;
+	volatile BufferDesc *bufHdr;
+
+	StrategySyncStartAndEnd(&save_next_to_clean,
+							&num_to_free,
+							&recent_alloc);
+
+	/* Report buffer alloc counts to pgstat */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+
+	tmp_num_to_free = num_to_free;
+	next_to_clean = save_next_to_clean;
+
+	/* Execute the LRU scan */
+	while (tmp_num_to_free > 0)
+	{
+		bufHdr = &BufferDescriptors[next_to_clean];
+
+		LockBufHdr(bufHdr);
+
+		if (bufHdr->refcount == 0)
+		{
+			if (bufHdr->usage_count > 0)
+			{
+				/*
+				 * Reduce usage count so that we can find the reusable
+				 * buffers in consecutive cycles.
+				 */
+				bufHdr->usage_count--;
+				UnlockBufHdr(bufHdr);
+			}
+			else
+			{
+				UnlockBufHdr(bufHdr);
+				if (StrategyMoveBufferToFreeListEnd (bufHdr))
+					tmp_num_to_free--;
+			}
+		}
+		else
+			UnlockBufHdr(bufHdr);
+
+		/* choose next victim buffer to clean. */
+		StrategySyncNextVictimBuffer(&next_to_clean);
+	}
+
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u next_to_clean=%d num_freed=%u",
+		 recent_alloc, save_next_to_clean, num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..9594f92 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,9 +44,27 @@ typedef struct
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
+	/*
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
@@ -84,6 +103,19 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Threshold indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_THRESHOLD_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_THRESHOLD_FREELIST_BUFFERS_PERCENT	0.2
 
 /* Prototypes for internal functions */
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -101,67 +133,51 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
+	volatile BufferDesc *buf = NULL;
 	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
 	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
 	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
-	{
 		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,28 +185,86 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		/* Unconditionally remove buffer from freelist */
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	/*
+	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+	 * not do so while holding freelist_lck; so set it after releasing the
+	 * freelist_lck.  This is annoyingly tedious, but it happens at most once
+	 * per bgwriter cycle, so the performance hit is minimal.
+	 */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
 
+	/*
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowThreshold  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
 		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.  (This can only happen if VACUUM put a
+			 * valid buffer in the freelist and then someone else used it before
+			 * we got to it.  It's probably impossible altogether as of 8.3, but
+			 * we'd better check anyway.)
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
+		SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
 
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
@@ -199,6 +273,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			StrategyControl->completePasses++;
 		}
 
+		SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +329,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -274,20 +389,76 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategySyncStartAndEnd -- tell Bgreclaimer where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ */
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*start = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighThreshold)
+		*end = freelistHighThreshold - curfreebuffers;
+	else
+		*end = 0;
+
+	/*
+	 * We need numBufferAllocs just for statistics purpose, so getting
+	 * the number with lock.
+	 */
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell Bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+		StrategyControl->nextVictimBuffer = 0;
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +470,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by Bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +559,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +570,31 @@ StrategyInitialize(bool init)
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high threshold number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist, if based on threshold
+	 * calculation count of buffers on freelist goes beyond 2000 or
+	 * lesser than 5, then we set it to hard coded values.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighThreshold = HIGH_THRESHOLD_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighThreshold < 5)
+		freelistHighThreshold = 5;
+	else if (freelistHighThreshold > 2000)
+		freelistHighThreshold = 2000;
+
+	freelistLowThreshold = LOW_THRESHOLD_FREELIST_BUFFERS_PERCENT *
+						   freelistHighThreshold;
 }
 
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..826af06 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..f7a1631 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,18 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void  StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+									uint32 *num_buf_alloc);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..edb9c52 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..46f6aeb 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,45 +89,44 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
-#define ShmemIndexLock				(&MainLWLockArray[1].lock)
-#define OidGenLock					(&MainLWLockArray[2].lock)
-#define XidGenLock					(&MainLWLockArray[3].lock)
-#define ProcArrayLock				(&MainLWLockArray[4].lock)
-#define SInvalReadLock				(&MainLWLockArray[5].lock)
-#define SInvalWriteLock				(&MainLWLockArray[6].lock)
-#define WALBufMappingLock			(&MainLWLockArray[7].lock)
-#define WALWriteLock				(&MainLWLockArray[8].lock)
-#define ControlFileLock				(&MainLWLockArray[9].lock)
-#define CheckpointLock				(&MainLWLockArray[10].lock)
-#define CLogControlLock				(&MainLWLockArray[11].lock)
-#define SubtransControlLock			(&MainLWLockArray[12].lock)
-#define MultiXactGenLock			(&MainLWLockArray[13].lock)
-#define MultiXactOffsetControlLock	(&MainLWLockArray[14].lock)
-#define MultiXactMemberControlLock	(&MainLWLockArray[15].lock)
-#define RelCacheInitLock			(&MainLWLockArray[16].lock)
-#define CheckpointerCommLock		(&MainLWLockArray[17].lock)
-#define TwoPhaseStateLock			(&MainLWLockArray[18].lock)
-#define TablespaceCreateLock		(&MainLWLockArray[19].lock)
-#define BtreeVacuumLock				(&MainLWLockArray[20].lock)
-#define AddinShmemInitLock			(&MainLWLockArray[21].lock)
-#define AutovacuumLock				(&MainLWLockArray[22].lock)
-#define AutovacuumScheduleLock		(&MainLWLockArray[23].lock)
-#define SyncScanLock				(&MainLWLockArray[24].lock)
-#define RelationMappingLock			(&MainLWLockArray[25].lock)
-#define AsyncCtlLock				(&MainLWLockArray[26].lock)
-#define AsyncQueueLock				(&MainLWLockArray[27].lock)
-#define SerializableXactHashLock	(&MainLWLockArray[28].lock)
-#define SerializableFinishedListLock		(&MainLWLockArray[29].lock)
-#define SerializablePredicateLockListLock	(&MainLWLockArray[30].lock)
-#define OldSerXidLock				(&MainLWLockArray[31].lock)
-#define SyncRepLock					(&MainLWLockArray[32].lock)
-#define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
-#define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
-#define AutoFileLock				(&MainLWLockArray[35].lock)
-#define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
-#define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS		38
+#define ShmemIndexLock				(&MainLWLockArray[0].lock)
+#define OidGenLock					(&MainLWLockArray[1].lock)
+#define XidGenLock					(&MainLWLockArray[2].lock)
+#define ProcArrayLock				(&MainLWLockArray[3].lock)
+#define SInvalReadLock				(&MainLWLockArray[4].lock)
+#define SInvalWriteLock				(&MainLWLockArray[5].lock)
+#define WALBufMappingLock			(&MainLWLockArray[6].lock)
+#define WALWriteLock				(&MainLWLockArray[7].lock)
+#define ControlFileLock				(&MainLWLockArray[8].lock)
+#define CheckpointLock				(&MainLWLockArray[9].lock)
+#define CLogControlLock				(&MainLWLockArray[10].lock)
+#define SubtransControlLock			(&MainLWLockArray[11].lock)
+#define MultiXactGenLock			(&MainLWLockArray[12].lock)
+#define MultiXactOffsetControlLock	(&MainLWLockArray[13].lock)
+#define MultiXactMemberControlLock	(&MainLWLockArray[14].lock)
+#define RelCacheInitLock			(&MainLWLockArray[15].lock)
+#define CheckpointerCommLock		(&MainLWLockArray[16].lock)
+#define TwoPhaseStateLock			(&MainLWLockArray[17].lock)
+#define TablespaceCreateLock		(&MainLWLockArray[18].lock)
+#define BtreeVacuumLock				(&MainLWLockArray[19].lock)
+#define AddinShmemInitLock			(&MainLWLockArray[20].lock)
+#define AutovacuumLock				(&MainLWLockArray[21].lock)
+#define AutovacuumScheduleLock		(&MainLWLockArray[22].lock)
+#define SyncScanLock				(&MainLWLockArray[23].lock)
+#define RelationMappingLock			(&MainLWLockArray[24].lock)
+#define AsyncCtlLock				(&MainLWLockArray[25].lock)
+#define AsyncQueueLock				(&MainLWLockArray[26].lock)
+#define SerializableXactHashLock	(&MainLWLockArray[27].lock)
+#define SerializableFinishedListLock		(&MainLWLockArray[28].lock)
+#define SerializablePredicateLockListLock	(&MainLWLockArray[29].lock)
+#define OldSerXidLock				(&MainLWLockArray[30].lock)
+#define SyncRepLock					(&MainLWLockArray[31].lock)
+#define BackgroundWorkerLock		(&MainLWLockArray[32].lock)
+#define DynamicSharedMemoryControlLock		(&MainLWLockArray[33].lock)
+#define AutoFileLock				(&MainLWLockArray[34].lock)
+#define ReplicationSlotAllocationLock	(&MainLWLockArray[35].lock)
+#define ReplicationSlotControlLock		(&MainLWLockArray[36].lock)
+#define NUM_INDIVIDUAL_LWLOCKS		37
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */

perf_read_scalability_data_v5.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v5.odsDownload

PK�XE�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK�XEU�zRRmeta.xml<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.2"><office:meta><meta:creation-date>2014-03-31T15:29:55</meta:creation-date><dc:date>2014-08-28T16:37:14</dc:date><meta:editing-duration>P1DT5H51M49S</meta:editing-duration><meta:editing-cycles>23</meta:editing-cycles><meta:generator>LibreOffice/3.5$MacOSX_x86 LibreOffice_project/7e68ba2-a744ebf-1f241b7-c506db1-7d53735</meta:generator><meta:document-statistic meta:table-count="1" meta:cell-count="53" meta:object-count="0"/></office:meta></office:document-meta>PK�XEsettings.xml�Zms�8�~���wBpH��	t�^��4�I��7a/����H2����0d���%s��'���]����������� $E���'�V
��>����0���a}��v��	�������.A)�E��q.�x�mE�;H$�'HGy��7�����JX���Q���fJ�N��X,Ng'(�����ec����!����������E�9+�f����g��Vr�4����a��;Wk�K�*�mj��F���E:s
��Yi�v�<RI���Ck����^�\Y��y����(�[��4����_��fi�����Ea���Y��M��n�_HX���g���`����aby����	5�:��	��]��j@z����>Ew�� &(�=��D���	��%U:��2����!�g��oP��an���}Hzb��@��P��
=���Bef����w=E��B>�0��|�o���f�KA�./��C�0H��|����F)5�
�#aQ5����6 S04�S������M2���VG���Sn_Y�+�?4�V��q�$I�0	y�y�����_��������Sd��0�y���W���#��;TUA�K�}���E��I��}qf��9���V`�"��Q��������!�@i��} �DJwy�.�12�B��-E��I8�!�
��.#�b��\�)�I������5��d�$|j9^���>�S��~I����
���L����L �?����u���A�:|?�1>f"q��jzA����C����� d�}�p�k#$��d>&.Z=��XZ��f�(�����gDO.:9��SD^7M��x�v##?����IA	OR�5����{0��V�n�?����{����S������w�S_������F
��yQ5�@~\�^�j�����[�C���.��c����g(�+h�,��w����;t
<��I��7����l� A|4��
�3���F�#����y�AgR^~�)�z�N���Cus��
��A(@�)}6X�����Z�1�[�i��o����N��h�_7d6��W�/hd����/PKl&@'�!PK�XEcontent.xml�[���8}�� <�`XYW��'�A����$���D�������4�08_�EJ�)���/!�A���EU��u�,u����Y���qZ���;vF��q��|����G+}��g�jEc2O���H.���|G`��y�t1�X>/0�|�������EI��j�k���z���t��R������R�c���=+e�:ax7�X�BPu�U1�x�SkU@��z�b����b�����n���q���;��l��8>��K�V�$%����k��x(>��C��lI���`���j�X�L�a�6����g�v}&����y�����'�S�Ot�������x�>n^��eC}I�N�bF�����u��(P�AMv�s���eM{wQ}�� LS�/��8�/�����k��E�2�$���g<�~|P�����s��M�!>*�+[4����0�	gW:�)&�Y
/��[��Fd���!���k�$��������-%��:��r>�l�tH\J��%�f9d_F�Jp*��8
��(��u���l?l:�E�:���1��������|f��
h�I;V��U{�����a`g���1����Y]���%���eQ1*��U�hz�?���3kMrX,P'+���JIE�eE�$���?g�=�������r��w��V�����������7=(��e��;U�G/����d�A�o
��;���>�,�8�84��<�,R��q��5(	��K�,����U)�p�*�KF�{kI�0�t�����h"�,o�M�8S�58���s�X�;#:������K�q0���e�'VQ
�������}p{O���Qd��d���M'�pO�'��n������L�U�5�j�k��U*N�-�nJ���6x����	ZKV��(e�X��se���Z_b&�A%\,�N����R�T��R�w�`������"�	��Q_����=����Sa{�����p�9z�.���������av'I	Mu|T���)X�w��AN���[�;�\�q���f��~c�>�d����&�����-�`���4�Er{�PC���!D\=����lJc�v�����Ve�<�&�sg������.Q���]G��-h�����Hj��\�����
�K�Ky7:A�}�������y=��m����X��c.<�^�o+����K���
���"_�u�����]/�5�����W�����a���A����^[S��(�s����`����"�Q��p/�����Yz��i�!/�����1�������7$~_����d�^�-�I�Ts�]��@�I����p'�����H�|�%>�Np�5��<��<'\6�\eL����#+F����o��P\�9	�	�7N��I=&�r&z&{�S�������{Z�@O��7[Lzk��������_�	��eT��$�%�'������@�!��e�Z��O/�Co���(����Oe�r�s(��a��uj��*���z]��%��!����{Q������(:���:v��g74�����}���08����������$xZ�%,�h�&��'��n��G��#�lB�S=U�l8
A8u�(H�,Q'��!P����	;��p�3D��2���"e��q��Tg��M3��L;��,c���5�H��`�e��1n�a�k��;3����l�?���d�e�Qp��s�J�(�iEZ����>��-W�����t�ho���<��MJ���|G�4�/2�m'�Rj|*����K�����B�MZ��	C�q���Mrg��i��d����|W'���B�\ojj�0��D�Ij�������D���:3�����sF��9�p&4�g�y��Z��ix]��7������
Hz���������J6��w;���a����:g�l�3���t�H����GZ�Kb�����kv���;s����J��+v�PK�\j�l6PK�XE� �(�(Thumbnails/thumbnail.png�PNG


IHDR�d��?(�IDATx���XI�'J-A)*��DE��������{C��7,����<{��+*vA����H�JI8���l�%K���{���������;��fw����@ FcV�@L�=�b �����*B�=�b �����*B�=�b �`�bX,�f��/e�V��t3/��h����l���S��o|�R������>�'��������"O#������#	����G7PR������I��Ldy�)�p����"MPlb �U��1>������9��Zs�Y�!k0�
R�����L�B���Eb������Br��Y2-u��@�@������dE�E��5;��$`�bha��n���b ��#y�i���Z=�<�X��os�a�fu��zZ�U�����c������jN[���f��&z,V.��t[0�����6z�����M��^4����Qf�wp��h�������'�'��[���5;J����v�dqbT�Q�'��z?L�=�1[s���&G���[p�����gj��"�������oj
R7N����q��tTY��r:f�����c��a�5�kRWG���V��s���6�F�3"����/����J���`�bd�)���u�2�����I�=�����r��8���^����w����:r��>�=r�J�8|����^[�����.��Q5��u��@A��^�z�'G��o<���`�?�l��s^0����_sg��J�v�\D�oy
��:��c�����(��Nuh�38�7WyK��	��Z��
������H�7wi6���+�����uM�u�h�m�o;7�W�&���O�N^.��e����-PH�
�X\������T"��������l��U���OV�hY�{�������S�]�� 3��T��J\X��QuAbaZ������RI�8�����S��R�$1��4G���rqN_�S(o�-�x'�s���V�B�\�����gK��<��b���b�bG�\��(���.�����F}Z��A���������	��@=���\\�98�?�E��mS��qV^�����Yp~�9}�3�=�y�����n�d]wY��ts����P*c{8�r*��I�
������ST2�~�\��.���9!�����`���N��(o��O�o6�o���9���������y�3�'O�����w�S/�9��/���=��^�;yE|�V���<��.
�x0������~z4U�o_���1�y���J�d'�\�v��a\��5I,���n�_4@ws���a����������q�P�F���:�����4�����j�����w�,�����)lG�GG,u�2+��bu���j�8���'��*�t�gW���/��SIy�����"R\<v@(~���?�����9xL��g�>������WE	���Wh��?��E�}G�J��������[�uu��'>����w����G��S��M�I:�'���[~���:3�q�'bWJ�?�v(y��7}�g^��J0�n����?��g�H�QZ�vk�}�4��b�O9�m��e
�\�,�������?�0�>X������x?{��&�2����l�^���mj��3	���p0u`��2��'^��c�W9��n$x-9��������������<2x�I"�[�]j�����vm0�����|MTj8��|�x�j�n'��;R�����c���+WE�T���������
�~�6G�4���[`�~��3k���a����_E�+�<��k#���M�H�����E-\��3�Orf��wm��4e�����^wb�������g���h����{�L:�����'����O'��mT9�!����%��\���P��i�����2�������j�����]���y�P�>��%�@����.�aZ������V�e�Z.���w"�����f�O2-�-8�E+e���.����{��im�uC��C���~YON�i�.Ar	u��>�l:�9T������^���?�|�����N�P_V#��*�|�vFm�w��~
�8>Sn��h��H�Xf��q���c�G!����>�3�F���SO�u�wKT�_s{e6���T5nh��Y�F�.+l�6�p$
�)F!KQV��+K���g>8�w��S�H�>�\����������OAU#y�T��n�<������?�xw�Tn��l�pTn��)L��(_N��1*�Jaa^�]�%�����WX��y�b�!{R�GFj�|w�)�qg�heyb��^���F-����q�a�5N�mU�[$v���l��G�������o5_�������k�o��`�)q�#��N����s�+���W�����a�����&j�t���q����F�j7���iJy���e�F.�8�����^)3���#�f~����s���d7xi]��1����z��D��:�����E��Jk;�tU��dY��\}���hu^�Ly38��zs���o������=��,��.J�������������@r�]K`�U$b9p��_SV��Qu�����G�E�eC�$��j�������\�C�a�~8��Q(�:�v�r-m�\m���,~��V�g+Mse���"9:�Xu��#>����]�xt+E6c^�����Ko�&����S����jyP�_M mp���&���[i�n_y*���M|Sp^��`��d�����4��+|������H�%��2�
9�}��(�CFx�3���s|��/9�g�D�r���(�9�v�,�H|!�!�e�,�����\k`�M�N�*��G�'�y�%�gJ8�fq�U��7?�HH���/�u����#�x�����psC����E�?���.�Ggg\.w�`w�
�����������,Pa����;r�� �8��5���C��sD�.��.E	,�����f�o�t��M��zY:8j�T�c��^o�>�6���Q*rM:�h��(��nQ���&3���EU�I1L1v���U�S��.d���+��"M�~��)��D�U��:#�QBC���]I
?���`�RB����#�h��P��F�A���q�A^F�_c������9�T�����q�A0�>T���������0�=�Z_��i�bL�����8�:_�\6��N����P�+��<���5��amA��&|�_�}W0Z1�~����%����:C'���?��l���5#LI�UY�@�U�����L@��\C��Q\�M��f|�o�O�8�@T�U�1o��0A�@�PU�g�
�q��<��y-���)�����
�W��s^&��9�C�IF���C��Z+��n�_f2C�`$�\���"���f)�B���w*��fH����B���A�CV�#��0���������'�H0{��b�'�Y�$���"y�
������@>����R29�h`�b�B��
&6��X���b �T�UL�9x���5�^n���)��<��)�!�f(�=:C�C7��gd��S�����8u���N��&�*dl}�,��Q^1�h<����S��=�?"aKg�
C-t�J��t����#��J
��G�c��8���t<�A��������{��&b���U#w���|�n�7,����~�tG�^����h���o���������C��H����&n��b�h���u�

�N�O|��S�t��o�����I����P���sw�G������z9�_w���/���Ml�*DgAF����~�3�J�rI�X^���"���MY�Gzk2y��[�R��h#����Er�O7��B��P�$9RK��TZ�+z[`[�Y����Y����!P�Jo���%����;���J^�U����W�'{�h~J�aHs%9yfU��h�V��^dXxy;�������y�"����������PNK�b6��L ����B� �\K>��+��:��
_�P�=��� U���o����S���m�,�f�������Z+O]��R��}�"O�T��W?6������|���}<�*�Z�m
T�\���7�����}d��[�c��]�w�������n����>�=^%+���D�+qN���E��uIS�]w���X��:��
�2fW`�m��#����|���7��K���O�:)w�<���_\%'����5z��<A=���K!H��[6��Y���4��]��5E��~y>���wg$���>�7��d�f@�ag?�c��Q��c���/�r�xb��Si}�]M.�~��{-�&Q����'L��������E�n�l�
��xW������k'���}���l�_��zE�O�W����bs<����*Ovx�>�g�@�o�������8���vl�q;���G�G��	��H�3K1ssk�|��w����y!7�hlF�na��3v�qc�E�����q���)u��^3��}���|`m]4���C���7����	��5����+�mx�[���j��o8�e�
�������&t�dc��9&.�@��.��/?��:�{�9�n��v����d�Hfmy�������m�_j�Q7�rR���'/�w����;�����pu#"�ZOo�w}�Z.J��������.�1�)[{�D������4�U4����_���>��I���������y2{u�[�	G�����������+���:v���9�G��V��i.���8��|���5�����C
�Y���Vp����������-�]�I.����Q��s��*��t���\���U�e�<~�_^���,5��^�%�,s��8G.����X��K�������*se���\�'���6	�����^�W���4�Ks���	����< ��WP��p�VV
���|���S����k[������+,���XRm ���T�8��W]2L0����������hb�����L�,P���J���"��e"��Y@�����5���M�+��m�+|f)&`l���o�73C+�Goi�T��w�L5��L����4�������v��+�%�1��)��jfO�Fu5��!��������O����zx�:~�k������[\[�~�E�����M��7v�U�����zc��y�S���u�p����� +�8K��W���y�����p�e�Hq�O��v�<H;����y�
]��?���>fyw�����U��>��ma?;z��<���{����h�K��Kl�
&�!\7�����?��92���	����l�{���|�j��c����� /SV����N������|{9:t����U<�l�P��,�3�
��)I�D�36�0%n��=;6����q�[���e�v�����j��e)>�v��i�_"V�^I��I�S������\sFz��=����^�f `sF�Daa��UU�9d�����w�D�Q����#������B�����9�q9o^�+���=�r��*5�p��������B2_y��;�d2���D���l>/���A�$�D�g�����I�yUg-a��E���J��Trn�����K�~���"'+��r�i��F�������;��IbJ��������������EU�kF8f��i�Bo=2�����^fU��PC[��pE�r�j����?v���OTh\l��	��S�
���G
Q�PY��Z���M��qtv ��^��pZ7�3*0����
�,��.�;D[���H���R�T��0K1��{GM=��+���������2n���}����#;\�N�<tb�n�$�M�C��gl�:�=z(��h�w����9��:���mw������4�z����;W>Y0��d;
�������g u�35r��'F{.XBY�2K1�����X�������F����H-<���>[���,������,yu*z=���Z�O*JI���|�c�������<>��(�PY����R�r_U��><9��{s�2u�U��`K�'�@�i�����������e����j�N��)�WS>O/��.@=�Z�T��V����,���{�Z���f�c��NY,���&��p��:`xR/�L_��^nVM����BK77�O���v^.�=;�Za�M�:E�����-Cf)FI5�Z��a�o3c��xrf��1G��	�A���vZ3������o#"��\�Z8�����S9'��\��+|&y�?������:�/cIgU�,sP������?�ytQ8��'[�	qo-8������6��m&\�Qu;���� +�����������P�&����������[�
������`������;�J����f�Ln2U�����������k�|^�����k}#"�����4'x�[!bBg�Y�,�Z!�Z������;|]�������[�8?�������CF�J�fe��������J���+��Hs��\���������������!7SE���|��@��[����&���]����>*_�q�a��z��6��m�3W��k��8R��g-R�5s6��_G/�s0m�/_�jg�/�^YU���M�7���/��\��%��(��Kp�z��Ml��{��X�!L��sFl�5�a����4�������'.��A���g�e�1/�3�<�'��!yU��.�<�d�����*v�p��fS+���Oc����R���Q�E'��]<������>���c}n�a�����~�r@]��j�3v�
�B�������,>:�{p�?��,z� �f������z���1�yIV��O,T�u�/���
����v�(�)p5�}����zV�'3�V>������+~���n/��w�Kg�����@a�r��	������+��1z���������4p������X,s�j��B�g,��f���d�� � qj���y��������CU3����C�b}�;A"�<:�*TSh��n���wOn���:�|V��]�O�,�Z���mA�a
}�����5�(�me���5s�!���v������?o�IpoG`�bT^��aY��I��w=��[-V8���l���/�vR^��g�����"������x�����x����e�&�G�����{�����������@]����W�lpz��z��U�I��8�%x"�C;\�%byE-W/��2I���NS��G�(�����e2�To�r���|������w��	d6!y����z'X(���m	T
G{tE	db����8-�Gl�
�������*Lm��K~��T����$������WD���7�X��3�gU�R+~f)�.pF��L��^��~��p>����Vs�X�Y�q������zr��=a����eV6�����jM�8��mSD��(	r��(�O��e�Ht���@��c��=���=J�6��D!�;8�E<4 n�Tgyi����@��];��(�
����<z�UM%�<*���2������_<f-d+���'h%����.�EE���<:Km�y���`����&8����A������"5�:&�2���BP�����?)Q��=����[�?{�j�_0�����L���T����yr��(*5���O`��P�B�C� M2���:�cy����4M�f��7�K�<� �_z'������M?��t��R�W��~V���G�S��qG��&�c�b �*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=��AO���+�fJ���F6- ���#�&H��*?�!��������
�p�F�l��G/S���w�����1�+AV�` �q�pb����k/����e�����P.e�N1��.�M�[vW��9<K8�0�h��$d���9w���L�����jjYCJ��/~�^�&c�;S/Y��Z��"�w
�T�P1z��d�'��F�����;n� )<:j��H3�� ��1��@��R��A�!����M���Sb������>/��R�d�0������Hv���d���J1�N�: /JL��s�`���A�RX�@_����PhSS���+]Zh��B5x����B�����ct2���PR���6��Z<����'�MMmC�� �+�{^!�_���T	dW��!�,�5+����c�!�����$c���K1�KFqW��$#G�3�^��N}8L��.-Y`��J�w]�'Y�|?��t�
)t}�
�Q�I������Rt" ��c���H�'i�Da�B�|�|K�,����/��|�cC��j��������F��������*B?�����E�r�[7S���!{�N�AW�p+�-�m��Joa\��/"Q����cha���A������K������TM�?
�!~���0
W����.��]C��G���4���V3T��YyM��K$��(>S"��}0���IP�=>����$�,r)���Q��<��~��0� 0��*�a��1��E��c�����A�'A{%�K����*���oM��AL��B��:cG5�h�024��M���b��D�k��3��	b�2�c�\���Fb���`H�_��!�8��I|�D��;���67K1��P�B!s���������|���p��Z���Q���5j#iJ���y�QI�o�����X�q���������$�����&P��T@G���1^%��L�a�������2'\�)=���aL3Dan�"F��e�>b�!��b})x�&Q�`��������7f�,B*v����Uz@�@�=_R`0�@���=_S1f�<_�Me�9z�����^_�� RF��������=�����.x
}Rx��tO����s��[����_Y|`������|MB.d�/a�R����'
���������\4������f��v��/m��L����6k�y, ���� ��� �=_��
]���.��1&0�4����*B����!z���`���o��B�������VL��9�$5��/�/�R?)j���L�����F����/�mm��*E��b0����'�b��[�6%]���"�P���D���T��J��I_��|M��4}�2��	[�
4�KLq��e�V4i����1��-I0����*B����!��TJ �Jz@�@��T�P1z@�@��T�P1z@�@��T�P1z@�@��T�P1z@�@�������8K�]���s����!�w��U��-b�W��Y�/��|�R�?�'L���zG���d�#����k�O�1	09~JHX�Em`��F����n�J���� �n����P�@vh|���1�4��$>~
�������;�0�+e4�:�����QJR|�0cB*�Qy5����d���F�(�H�
������?	�V	�;�"������#R� ��������b�B����D!��hU�����f�o���
+���d�a��7Od'H}.�����e�^#s3!�����0
�=�b ��=�&
����@��!k(|j�M��-�L���r�P*�FA�
�]Kf��)�3Kf�`��[1��0r|2L��4�z%����~��ai��/���?�B�E�������6`�k($Kx:���P�>�#����*m������l-��
���d������?����g��)uo��@���]���
!�����J �r�
f��Y�����Q7�F>$��@�% \�_OX�YEc,QZ����z/cv�>&`Z��,0cf����������L�M��?p�b`0d@�B��|���Xt~~�	�!m�(�](���b�������z����!�a5��B����L#�Y���]C��G1��F��/b��[���"�~�Kt�2�[��h�'hS��U�x���`D��ZP.��CS�b$~0�]Z���?
�j�Q ��;x���Q�2�M&�v�`�B�=J�M�A�����
�b�k�_b �`��b�!�`����#���IP�=>����$�,r)���Q��<��~��0� 0��*�a��I-_�!�J�b�����/�^�c�`�6f(�4��1!�
	S�`�}%2���1�����4<C��"����O�&�At���tsfT��!~I���B�Q��T�Z"�B�|H�4eE0���b0g�?(E���|�92j�:M)^'�Z������b0�2�A	5T
�/c!�0d5M�a�����������)!�_j�WH0�(0�_ix��a#m�r�R���d���5����2r�,� 4<�3����	{�z���K�D`0$�3S^��IEND�B`�PK�XEConfigurations2/images/Bitmaps/PK�XEConfigurations2/popupmenu/PK�XEConfigurations2/toolpanel/PK�XEConfigurations2/statusbar/PK�XEConfigurations2/progressbar/PK�XEConfigurations2/toolbar/PK�XEConfigurations2/menubar/PK�XE'Configurations2/accelerator/current.xmlPKPK�XEConfigurations2/floater/PK�XE
styles.xml�Ymo�6��_a(@��);�������KQd�`$J&J�I�q���H��Q��tK�4"��=|��;97�O���TL��`1��#�<�~A������$,��XDeFs��>r�f`��u5�	J��QL�s�Q���Z4wF�&zm]U#v�������IO56��-y�������$����6�1��Iq��� �uX<q��;��5���a~�����j��v��<�(%��8��S�L��|�6��L�g�MJy�=R9Y�I/���
 �]���j���k�N��}:"s�#rr�Yp;U.���r7m3�w#���w0i��~�+�M�e�-�"������M{!��j��n�.��
W�
��,� ������#�#����D���Iy��f��+���BjO$�~��:KTw:��G��:h*�x
t.1[84h����u����-���R�z���=6s�\�p!�7v�J,��+	��r�����F\mo����g��i��d&�p�$c���i��9���J��DLe��LGp8�D��w��d��v�����)5�wO��(:���C������,^�B�7]`5z��Qi�}
����'��;K�z�j'��&��u��V�)�F�<p��H�JR�P!!��f��TS��UD�b�4�M���:�N���Z���I���<-I
c4��(s-��o���!�������@���}���zU7�v`mS�8}�F����w�����&2�oo�^��V*M��A4��2�#�k��]�R�Hn
����J)��<�Z�W�C����P�M��eI�������M;b����^�=��"#;��e�������ep~�n6��,l[m�7$i�~�(�X������$	�����4�-����Z�J��>���UM�����|�"�o6>�HQ�ZQ�J�3��,�Q��:4��W�3���P�n��}��=U����m�_�P{g�R����&�En=.s����AF	�������%�#�:}�S�{�9(Kw��@n��+����ik��n�����WJ���_o4P����4G
z)S��+r�C�J��V�%����,� �i�M�!4h���g	�d����2���Y+��"I]8�H*E�.��4�Wa�$�t0��Inxocj�6����E�[{�+�E0��2�D�d��M�%�����`w���K���ufn�]�p��i��0�)�p���������G��y�
}w���T�y��	Pl���L���;3�(dl>�Y��V�������~Y@A��#�p.���>�
�����B����@@�Oi/Oz;���������wU=���P�V����5C�s0p�U��7����_�(��������G�����0�b�t��^����������v�n�_c�G�-��,���9i����(��)��~j
�!��b�^V8�4oB��W=!Z�ZC�u�CEG;*��<�.NV��B�5Z^�0��;k��~��0\�oOz(}�������&��YmW�&���r�v<�'���PK`H���2PK�XEMETA-INF/manifest.xml�S�n� ��+"����	5�a����F�	��j�~I����I�����=c��������El��|���������E���M0�: ��������QGC�4������hKd�5_�J�2��Uu����z���%;@�L�C�F������ju�V��������t��$/�]��N��PB�e���jl�6[6r�}$)�8)z8����������A%�o�%|�q��C9e��r��5b����u��2�����l��}_1*8�������������7��7�~PK����PK�XE�l9�..mimetypePK�XEU�zRRTmeta.xmlPK�XEl&@'�!�settings.xmlPK�XE�\j�l6-content.xmlPK�XE� �(�(Thumbnails/thumbnail.pngPK�XE9Configurations2/images/Bitmaps/PK�XEW9Configurations2/popupmenu/PK�XE�9Configurations2/toolpanel/PK�XE�9Configurations2/statusbar/PK�XE�9Configurations2/progressbar/PK�XE9:Configurations2/toolbar/PK�XEo:Configurations2/menubar/PK�XE'�:Configurations2/accelerator/current.xmlPK�XE�:Configurations2/floater/PK�XE`H���2
2;styles.xmlPK�XE����AAMETA-INF/manifest.xmlPK6�B

#25

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#24)

Re: Scaling shared buffer eviction

On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have updated the patch to address the feedback. Main changes are:

1. For populating freelist, have a separate process (bgreclaimer)
instead of doing it by bgwriter.
2. Autotune the low and high threshold values for buffers
in freelist. I have used the formula as suggested by you upthread.
3. Cleanup of locking regimen as discussed upthread (completely
eliminated BufFreelist Lock).
4. Improved comments and general code cleanup.

Overall this looks quite promising to me.

I had thought to call the new process just "bgreclaim" rather than
"bgreclaimer", but perhaps your name is better after all. At least,
it matches what we do elsewhere. But I don't care for the use
"Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
else "bgreclaimer".

This is unclear:

+buffers for replacement.  Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.

I would drop the discussion of what was done before and say something
like this: The data structures relating to buffer eviction are
protected by two spinlocks. freelist_lck protects the freelist and
related data structures, while victimbuf_lck protects information
related to the current clock sweep condition.

+always in this list.  We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer.  The list is singly-linked using fields in the

I suggest: Allocating pages from this list is much cheaper than
running the "clock sweep" algorithm, which may encounter many buffers
that are poor candidates for eviction before finding a good candidate.
Therefore, we have a background process called bgreclaimer which works
to keep this list populated.

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with "bgreclaimer".

+and move the the unpinned and zero usage count buffers to freelist.  It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.

s/keep/keeps/
s/become equal/reaches the/
s/high threshold/high water mark/
s/of freelist//

Please change the other places that say threshold to use the "water
mark" terminology.

+ if (StrategyMoveBufferToFreeListEnd (bufHdr))

Extra space.

+ * buffers in consecutive cycles.

s/consecutive/later/

+ /* Execute the LRU scan */

s/LRU scan/clock sweep/ ?

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented. Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

+ /* choose next victim buffer to clean. */

This process doesn't clean buffers; it puts them on the freelist.

+ * high threshold of freelsit), we drasticaly reduce the odds for

Two typos.

+ * of buffers in freelist fall below low threshold of freelist.

s/fall/falls/

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance? If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

+            /*
+             * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+             * it; discard it and retry.  (This can only happen if VACUUM put a
+             * valid buffer in the freelist and then someone else
used it before
+             * we got to it.  It's probably impossible altogether as
of 8.3, but
+             * we'd better check anyway.)
+             */
+

This comment is clearly obsolete.

I have not yet added statistics (buffers_backend_clocksweep) as
for that we need to add one more variable in BufferStrategyControl
structure where I have already added few variables for this patch.
I think it is important to have such a stat available via
pg_stat_bgwriter, but not sure if it is worth to make the structure
bit more bulky.

I think it's worth it.

Another minor point is about changes in lwlock.h
lwlock.h
* if you remove a lock, consider leaving a gap in the numbering
* sequence for the benefit of DTrace and other external debugging
* scripts.

As I have removed BufFreelist lock, I have adjusted the numbering
as well in lwlock.h. There is a meesage on top of lock definitions
which suggest to leave gap if we remove any lock, however I was not
sure whether this case (removing the first element) can effect anything,
so for now, I have adjusted the numbering.

Let's leave slot 0 unused, instead.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#24)

Re: Scaling shared buffer eviction

On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.

I have collected more data with the patch. I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.

Performance Data:
-------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Shared_Buffers = 500MB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 56248 100112 121341 81128
56552 Patch 59389 112483 157034 185740 166725

Shared_Buffers = 1GB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 56401 102557 121643 81686
57091 Patch 59361 114813 157528 188209 167752

Shared_Buffers = 14GB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 60059 110582 152051 130718
97014 Patch 61462 117377 169767 225803 229083

Shared_Buffers = 15GB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 60005 112928 153650 135203
36343 Patch 61345 115569 168767 226150 36985

Observations
---------------------
1. Performance improvement is upto 2~3 times for higher client
counts (64, 128).
2. For lower client count (8), we can see 2~5 % performance
improvement.
3. Overall, this improves the read scalability.
4. For lower number of shared buffers, we see that there is a minor
dip in tps even after patch (it might be that we can improve it by
tuning higher water mark for the number of buffers on freelist, I will
try this by varying high water mark).
5. For larger shared buffers (15GB), we can see that there is still a
dip at large client count, although situation is not bad as compare to
HEAD. The reason is that at such high shared buffers and client count,
I/O starts happening because all the data can't be contained in RAM.

I will try to take some data for tpc-b load as well. Kindly let me know
if you want to see data for some other configuration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#27

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#26)

1 attachment(s)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.

I have collected more data with the patch. I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.

Performance Data:
-------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Forgot to mention that data is a median of 3 runs and attached
sheet contains data for individual runs.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

perf_read_scalability_data_v5.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v5.odsDownload

PK\"#E�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK\"#E��DSSmeta.xml<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.2"><office:meta><meta:creation-date>2014-03-31T15:29:55</meta:creation-date><dc:date>2014-09-03T09:48:56</dc:date><meta:editing-duration>P1DT5H51M49S</meta:editing-duration><meta:editing-cycles>26</meta:editing-cycles><meta:generator>LibreOffice/3.5$MacOSX_x86 LibreOffice_project/7e68ba2-a744ebf-1f241b7-c506db1-7d53735</meta:generator><meta:document-statistic meta:table-count="1" meta:cell-count="261" meta:object-count="0"/></office:meta></office:document-meta>PK\"#Esettings.xml�Z�r�8}����N���+0�f��l&�Ifv����*��%���W�������v�K:����������*�"�Z���U��G��k=���Xz�]�tJ]�=tC��JPJo�}�K;Z�Z��6I����V����1{w��=yf�?u��R�]�-������Y�qqqQ[�n����tv��h��(D|dD���5��V-�lU6J���i��v�~���F@�R�
|c����Q�ki������jV���3�T�	��2���.�U�)WV��n\�^��}S��]����zj��8o�:�����������F�X��O�*�<���d?����c4���SS*������8��k�4���h�Rt�Hb��'���G9"���(�QR���[��/��t��D�V6�k�rE�0��F������ u�B�i��P����	�}W���G��R������`�����%'n
��D�R�'w2��wD�a

i�HXG���zV#���)z'#�3��'A�>Ad@��S"�d����7G�������5���.q���B�@�&D�����s���`������^�)�oJ�L��c�|�k����}�������!d�:D�"��$S��9o6���=��`�k"����G������!�@i�)�F~	�n��Y�d����[���`�#"�]DfE�7r���&aR�;����������x^x'�xtAe���'+�5t"��3����s������6Mj�	��ID� �8����-o��9[���.S�J�Z�����3����L��
�u��H�����j�
�PL���� �2�wtp�[�����6����6�F~
�����$���
j�1��`F�����r�����;����/_�_���O}u�:;$�*���\7���GSw�}���*�*�&w���?��!CY^A[f��������Y�S�)�O������A�V�G3�/!�?��}�i�>��
I���t&���~��{�7�t��8�7�y�����a��g�e{������5��N�f����D���1��qCj_{��Z�_/z�PK!���+�!PK\"#Econtent.xml�]_s�����(3�d��H� A�t�����97����o74	Y�Q$��,�O����$]����M���l1���]`	���K��OW���U��h��8K�q����h���_l>����^�q����gS�7v\�
�k�v^��wG�Y��Q���y4�i����Z�������R7�Ygu%�k7����,e
���{�JX�N�h�UY�B�������U���Z}ZFM�f�U��_G�I�����|>?������a�����J��U��J�������'�p);M��>)���������4Q�z�e%j����[A����./:����-�O��s?S�fW�I��B]w5�-������s���_U��uIY���*-;?f+��E�2U*�dW����-���w�����&�)GY�j�b���@���-.e�_�H6D�E���+�:�Z�?�>|�'b����i^7Q~�2�|	[��
+QU�j�qw�o��l�4�l���w��U�ls�����L��;������Zu�TdK��d�#�JQ��I�Lv{ZC�A�(�SM������[q�C�x��5r�uM�Mm���P����tQ�6������r��.�a,��Q,�D�Y��U�W��K�G�w��J%#�Q-E�iv�����=�B���@�i��*7D�������+��wW��J�lC����+��i]o��2]��Tm���07���
V�9*���u�����]���>��������e���Ckn�,�������4�U9�^�������la�"!{�l����E�R��IEm����JD_�s��e����4�C9!��Se�f�v��m�U�|�0��[���'"��K����n�g����I�Qf��M5��M�nzB9G��>�����f�����Hv7�����8�y���2�v��^��Y��=��@mP��u�E�{��������AV��ywSc��M�l�������w,������J��r���}E����9q����"K�;G����[��t��`��=�m���[(����"��W�`8m��Y�oS&�7���������Ww���_�tl�����j���L?r�"�"�b��.�n����!����R����YY�����C}%�
��F����+y
0�:1\�8/�������/�'B4�_�C���^�g�F��S-�
wuF��(a1�� �&�����a�<CF������O���`YD����%�V��	k�����N9a]3�� ��TM�.���]3��u���-��(�	��.A�n*�W���rj���.#x-������~W���bV��9e)~�f�'����T��&J��P��c0�����u����?���t\��a�Y�����'Q��a,�I��(���*����Y#����*�=������~P�P�����UOD��,���R�R��P��#�v����&����(f���jEb+['������i��P�}��="��o��Aqd��|�Nzq���J��vR�E-������gZ�j'�����e��������6�/����*�x��L&��4�E�Nj�,�";�/5\��J��������E����8K�2����7������0�M�ic1��I���������"�r>�EU���o�>�a��(n�
�/��`�`c��������8�~[g��o���}������=��h�=���*�Z�������USrS5%�U��M�����D{��E��$��$�vQ[�q�wq_�C��������^�w$��[��Go�qi�W�@���'�����:�"����=Ibd>9��|��O�������j|��%�i��x�\����<88!
�y���&�0���b�Vp�o����x�k�������0������~�!����C�!���,�/�2��B��KvRJ��	x�����������'����#=	q_�Bx��E��i�B���������Lh1�	��aR�_�����������Q�s��(����1��f����-F'�GLR{{��Ccn�	���(�;�<p���02�(u
FI��(z��(��E����Pu��|���G��q��x��Y���}<���������������tpvkr�C�'$��qp]���{�K�����A������o>cz~��O���y��J�������;�c�Sbl�Q�5x�w������8
���|������!����� 6�<fL�$�e�����}��C=�D��Yi�s��~U�_��W%0F�>���ciH��l� ���=mRa\X�P=mRa\8�i��������/�Q�������s�]�q�Q�3�1�Q741��%���\�Ql/�~h�B��C�o�+��(�0�/)��([g����
<�``lF��g

����7�v9}��}��p��?)xr���}��� |��s�	
"o�f~DL���@�$��-F�
�}# n?X�R/�qm�bo�yz�����s}/������XfP�w��S�wN��	��w����y�
b�w�X*V�w�A}��y��1)��;���sO���'!����D�*�1�)IC�H^�{�4t����w�k,b(������0��
�+��%0
;�u�������k� H���`d6��Q�����������U�%0
;��F
E>�u�?9j16��1�Q�Q3��NA��L����9~����%��KA��h��}B���~���-W�P�{�yO&�t�8���0;��H�T�����q=���Kx\���G*bR
9.rB������r�:����:�o9(�jB8�g/
���^t�q�)@l�1nf*12��c|`�0���c���'1.�8�����d=��D8�L�~q���C�8�/N<�m��\p�I�����g�"�5!p8�MP�B���h1�	n`u-~	�Bv��g�/!2�j��(�<�>5C����_i1�	T�Q����K`r���L��J	���T��g�*�<@�����bdR�7�I��Q}*B��`F������'��~����2�7������r�0C���e
�l��7D1N�T�+�T�FV����NEp��� 2�����d���G��lU�y	t�e�g����q1=�1��`3���lHal��?��0��GM�I������e|&�}*�q�"��O����~q��n�`�"�HE�<���q�������s
#��x��eG�q=����K���'�D��A���|�;>�����A��m���MI�3�A�.��u>I�������/$DOC �$F�S`�b��O����Ib���P_xU�O��e����We%�B�Zj.���b��!����Er}��"�M!Z#��}�PKr���
E�PK\"#E�O,�g'g'Thumbnails/thumbnail.png�PNG


IHDR�d��?'.IDATx���w@g��(@$(CTPQ����jw��Q�����Pq��*��j}U�FmmQQ*hT@!(	$���.�Or��>�?��{�y~����������R���0��aa � ���`R�H1v � ���`R�H1v � ���`�����e����4�&��l3/��h��P��!&0��9���C�w�*I���IU�D�v��if�D��h�:��q�Z����:������H�nR"L"`��I����0eE��i���pW1�sc~&Os�w1RwF-�9��-�0��n�PRz�i��P��s�I��^��B��Uxn�<+B�����H�b����hw�dNZ������
��w�����|����
���u�1&�p�P
��C���{�}�E����hb��9��������@�C��y����yj9����y:
��k����2;���b&�o�\���F��_&U�����N�	�c3>�KY���B�<�h��=������&�d*�2:zl�:���Y�d����V��5�q��`���]RT+1�I�y���Tk=z��\��N���=w]J��sVLhk�Y�h���#KjlQb���MLgT�<m��?�6��[I/}��E^�_��Lo���8(yG7Y�Z�&?�u:p��m�&D�l�%O�����Q�t;���Y{wN���bS�F�M_}�hp�������&O+�?�h���Cg�-h�}��?n������;�c��yoP��N��X�R~?�������s/����0/��XS�g����T�*��0��g�q�<Z������g'E��q��_����2|�������I�9��������R�L���x;�	T�ii�5�9[U������x���������pgf�����
5�gR�g�����M���g���q�k�j���U���"wo��'Ox��=�����R`'Z+��oEb�J�������t����D������e�qK1���/�@q����^�=m��Q	x�xq��C�����jA�!������E"rB�z �����/�$N���N�2&iZ��wO
��u5k��c�V�lv���������[����/32��j��$8���Ie�������^��P� ���r45�D�H�ay�wc�*��u"����#��D.����.n9�c�2��������8�����v���W��+��<4i[��M�D�������<W
:�$e��H+B#��������=�n���� /a�����~c#���&����n�i���M����E�_}��Z�i���_�U3t�O�Y��Yp�E����R7}����m�V�������B�o�bT���W{m����l�N�9�K5!��_K���)����/>R(�w�z��3 ����������_};���D�>��{���I�TZ�S������%�����!#�v��7����w����O:|�q@7+����?����'���<�Q
��=xwd�����������{
A�����������?c]�	��:�]�a�������e,m?������`+�c�49zzuO���n'�8�������#�]���#�d�������������9����E��T��$���>�����4����
]���������wXl���2���s���E<-=�-P����$~QR� 9�����#������8<�^B�/��Y�p^zS���S�$S���r��� ��X����^Bt��J�������'sR1������;>������3�<t�I����77���J%h����9%-��q��W��i��G�N����vz:5y���lz7��y����@U����5�Z������wkvFw��wJ���%�!y�2�Pa����_=��v&G����V����J"i�e����@���^`�Q�����F-����*['�A��z��xz������W3�k���7�wH��zZ[C!6�����*r��{��}p+����'���=��K��llj^��9s��Z�U���5qQ������|{����;�V����W�m��}vg<=:g���M����`�R��VG��H�\��*E�\��M�,6Z��e���m�?s�g�.�U���q��{�Uq�4z�3RT����6��D��k�� �����n��`k[�?]���[j��e]�{�
^�LguQ�K��K:+���a9�[uv3���_?n�.��II������daE;���4����D�]pW�����A��^3'����8#[n�g����_���^-#�>��|����q�>��v�t�w�������I�1�Z���x�s�/����xsJ���(,��}7E�rQ)
����ar�fe��?N��V�6�?������Ox>�xt	IY�����1�>�P�y��S���_|��O���W��2�������w��-�8�^����$n����������&��KY! ����I�m�����\�5�^�
1���;��_�^N�Y�����>Y��D]K�d�=;+�^s�kE�OR�_TG����v����#9�����	Yt'Tb����u�F�G���v������6����r�^A�2��M���-��	���'c;Y�w?�{����y���A/���W��L���R�����������y���+�'��5"���X�n���@'l���5<����sX>����*��S�6u��w=Q*2�g�c�9�O
������X(c�jsM*��1�����.<}Y!s�S�pK1@������������N������h�1N�������M���~Q��-1����/��K,^b�����SPXi+_�0LB]���IJ��	���x�	��-���|^]��|����s�PR� 
�h�����������;b>�US�X@�;���)����u�:�S��.�������i��cp8�@���U��o0�(�`����}���`������
��pQ1��G �hR�@^~��GH4���b��Q����U�t4��|S9����
~�j�]*�?��]� �	R��S���*�[1's��{���b���L.�|P7����P?(��*�o�"�O[���A&=���i�L����bh�&c��/	f�H�$���=\T��1@h3���6%�WmL��^��?5&} rT�30�Ya�d�gi��`����)�q�b���bL�H�=A�@��V�{�*�s�!��2y-������&���+��E��r������d|+ �9���k0\����d���0��x\L�ZV3�-�0A{����Z1�����eZ��r��b�v���&�\��Lh[x��'p������?qX�R`�` &
I�V�	�y����60j���b)Y�VtpN1laU�&[�Y�L,^1�J)��*����y��uq�*�2�9�T��G�c5$Zo����<��������b*�������;����dR��mo��&*��+E��c�1�r��b*����3�i{:3m�h��W>T��G{�L��*&��)7�*�s�aK�:x�pN1���L�:��
��������5������Zc���;e��W���Y�g�������\:Fo]�^�������-��5��,������p4�������#T���2��q�����
M>����������q�L.�{Oq��J��s�6N�8�r9P�gd�4h�,{xz�&a`O��904p������xxD>Q�����O���#[�D�@7��\��0'��c=7�q�?>��Q��c|*���g:xk�j�����Bl�T��*Pdgx4���$M.���"���(���|+q]9~|�e�X�����|������\����0N���y"��ftgk1�_�4�HP[�SH�R�{]������jW?wG���
����Y��Uu���-�$��;������m�6?���z��\�i�����.6��I���)�������o?�����d����c�G���;:nQw��� /'������}Uz���8�������^�����NW�*��������nN\�z8�8�J��jGo���>W�7/^��e�v���E5C�F�m�,��Z���o��I��c����;��5����_|�h�������}�K[�9��,qA�s��u-��xV3�&1��;����.tN~��;v8�nLl������+.�<�����K���&���g��!����bmv�����v������`���S�<���������X�����wc���Zjx�=����_�o���cf�V�yJ��V_���s~���)n)�o[��A��D.� [�l���yhc6>&��I����������f���?:��8\��*_�7�G4K$y��9������
�'v�t���v�L����&�}�u�7�������_iU�cO�`�v~�wv�eb'�[y�)K6�l�_4�i����V�����F]�x��+?N��vn���c���_���5����^�aa�w^��1~�@�����]�	�����&�{'��]{8y�7����m��M�{+�p I�5��W.�,�������@��G����������wb:��m����$;���1������W)��������������7���uN���[�sK1���y�~t� l�JL����YI2����n�J��ddJC5Zt�-:�5g�f\�G#�N���������Q��R�
��j���]�bg�V`KG�JYlx��e&��|�*��a'���[�Vht�5cH����.���
�H�E�B3l�X7�%���?b����jZ6DM�P�T��V��u#I�������x|C���"�Z��i@��BR5ba��-�VY���x���t�~-������9y�)h��[��3|��wn)�E����:���`�pS�����i�kP(��$���}(�X�&GO8���p�Mk�i;{����4?��]Q�_������g�������]Q�\������U	��X�+���T�|���Kf���8�kW|,�@��0�������g�s��uozP*c	�3�)�qaZ���{8����Qo����j��{�Z�����:����
�IK.���B�������C�7��T������5��(h�_HI��e����xsER�.�a��X���#5�F}6���u<�����}AA����u pq
����}g���m���b�_��0�����)���@��}���p��j|���1�C��/w��������4�H,.!��C����:��������P���__���b���u��\4s�����5�3��V�Y31����/2���p�����k�k��o����lngl?������y���]��9.'#�*7XssJSY�hY�cg���`��l[��:�x����i�b��d���u4N�q��k���_l�%�L����H��i��]��_���mz����FS
��^{���lv�{H�p��}�xy���k���k��M��~�c7�$�1�KF_�s��
�������h�d���������E�G�G��@n�Z�o�#���Z�.��{�*��xgR��`J�"l"$#�)Y0~���d%q��aUK��; /���I~?hU+(��~��f&������~�@Pg���p��i��]��Kw8B]>��n���������tj]�ts�]�0>�����4�\��^=������[8�0B�Yb��4���bl�4N���#(�����e3�+�2����R;`��h��b
�S%���~^+���O��q�>����������d�%3�)u���0��G(��t�]6����Cr!ze�����������a��HHt����W����V}��G �����G,y�4�R�8��(Z�r��d��n�%��W���:��jYc����B���hF����m�.n�H\t�,;SY���(-�M�,;?M5��������n"�X�j�o3~�-����p���7�&%����-�z�f*�]_k�]����Fs�?�	����
�J��F^�������Od�i�����>���������#�'d���L�3Mw���5�X���]}�n����-�	���a�c����'��]���nM�;3iq���MSc�w�F���Z��6�HO�����yI���|t��g�B����'���\�S��+Cl�[8|��e1^��~����m"vVs�+,|�=`�/�fC��x����#���4m�T�h����i�����g!�	�������O���}������I+F��qN1<��rR��m�5�%�.0��da������T�{�����}Lm�����G����������d��`mx+���}��N���D��d�dh������L�3��a��l������|t���+��7z�_��cGQ��H'����H=��i��c�hIi�#�s0J�~���i��Q*��W+a��;��lk*��Z����\���n��VW���K�-��9h����j����m�w_�ug{��
�����*�l�6D�2#_E�_�s�6�� K>����z.�%��m4S����fK3�s�J �`�b�B���[�V|��x�"l��KN�f�4�b��7��X���Y?���z���Q#�{��3��[IHc�&�N�}�(�MC��oI���?%��
�Iz�+P���l���{��b�<��������o�^hU�]�V��y�&cYW�T��B�������K�����b��?��W�D���\����A�7\�MI��Lo������1�k�liG�.��W����N�Z|���0�O� �V������-����i>������<�
�8k�4�AK����|9n����K�����|7&�����)'������yu�����?�DH��v����j�i�j���|�5;S����F�.������I�+����u�������;�N���A�V����:����,N���/���8ON?�9x�Ek����N$�dN6
�������jH4��K����v.]�r:������c�LV��
�P�O�EH$�	�T9�-qO�k�L,us��w�(	���Y9K�"Qs��h���k����j=I�F$t#�D����������\�
v_�	������:����)"{<m�����!��s��������O�����{|{P������Z���>�8/q~9�5�������c�L�xb<S��#��F��i_��f4��"�l��M<b>B#7��@���3��Q�����~���; ZvnM�K6��AH�M�/��|7�Y*`�N[��c�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A���|�N`<Wb%��J���:�,^.|2`�!TP�2z�P�<d�Z���R�S!���H�-0u�!�P�Er)wm�L053$hg�����v�ax��z_&N[MZ��R�u�i��;+���Y�9��@��)a�GLv��2S/�+$�
��21:�� ��R��%m2�3�2�,�eD9����@�A�)����b{@u�����z$�������;3�$3�4&O�Ur�bH��O"�O=L�&Lv�)1��� >���/S�R�$j��d�l�ih�J��t�1�L���F1p'P*0W%)�z�9k�����I���`������HkERx_��1j-T��������`��11�?���p��2S���y�]<Skj)��x��ClL�E{��i�c��$0�9xGFZf�V��S�'uC�����$s�9���1�K��L�CT����s��e����H�:�L��0G(���*?�*������TD�cxV�r8�<��ay�v�x�:�#j�0�1�*&{��������+A�]$��o	����w�d���53,4�/��4��E|��k;�b�(�;x��EL�r[������o���V�+Z����� o�\��/"���b���VX��e3��c�y��Ro}���BKC-a����%��P7Y��es@O�v � �Q&?���bY�P��fq�5�/1�HhF�|�D���0���EP�O|x�U��I+Y�T�G5��0���~��p� 0����a�1��b��q�PYQ�� �K�^	�T=]�pc����@}7��TI�z c��f���$C�kZ�Ty��FHez���%vL�T����^bq�p'��a}EK
���x����qi�r��P��W(L�|=���9�/��,�;���\��k$T��6M��01O�7*i�S
�r�|9K)N@%�����.K��I�i_l��|-b��+��U��Ti�Q�9"�uV!1eN��T{,<_���nbn�#f���?������<_�hoH�^��
:"���Y���J]!�e�F�H1v ��-����k��R��EY�wh��/�k0���7�OH`L���k�6��[��~��Y��,��b�sI��qk��3K
�<�����E�����m]*��c*�r�Xn�V����������l7��[�}Y{����/��Y+�mS���/�`�o�&a��r\%D���wz��9�1�A�H1v ��-�����|+��Z���ej`U)�Rz��i$�����5R�eR�L�h��A
R�9A~������&��r���AVt0��PpJ��o���l)���7Bz\�_C�3S�B-�i���4��Z��X��zM��UhF�%�����}��0�N��C���[�`���`R���eC����� 8���@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A�)��;�b�@�A������/q�����M����!Qwd�U�?��W���|w]���H�P���������C���#�sF��hs�|tM�i;)���L_5S����qh��&�!~�F� �+��
u/@97T�k���E�a0)��hj���H�_LRP�������./�m0�+e"�6�������T�fRH�.07c$JQ�\���k������L�N]O������J���
!�,R�RK�� �0������B�2���Y	"�$V
���;$��R��&j�J*���)[j��t��c� \�.{U���037R[���@�A�)��cmR���d��Z_qZ�&,<����HOL&��=�r�0~V�v-��K�t�,�}B�oLn%�r&�NMC�W��y������m������G�T���*I9P�39�p��"Y��!�N��PK�^	���\
�m��Z�:l����,����� �����C=�y%]3l���[�}�6����{��!�_��04	Lg����������I��35�&��eB�+��z����&�d��*�3|/sv�G
���e
���9B�<]W�AVyV��"*4[=(&���`����;�Fp
���b����HO,����Er��!+�"�]���������G����Ku3��g����;���}����� G��c�Kt:L�2�[��h�;hK��W��IDreH"�����%�cXa���!������KK�����-
��%B���0�+C�dAo�!�=]#���`G�����euCU~����h���#�����^�� ��A�>���W�S'�d�SI3D��F���c��S���V�@�����+_�#���R�^B�J������d���3�o-��A\��J �1������H2����H���j��Q��k��Yb��0�a����m����P0��o"�A�0�<_;J��Z�WK�_��I���b�wZ�1s
��;���#g����4�x�Lj�c�c��L*�VC���r����T&~���
5�h��/����]_%���hd��Z$���r�R����f���������2s�"��5<+3���������X��-���`�?i�~ui��dIEND�B`�PK\"#EConfigurations2/images/Bitmaps/PK\"#EConfigurations2/popupmenu/PK\"#EConfigurations2/toolpanel/PK\"#EConfigurations2/statusbar/PK\"#EConfigurations2/progressbar/PK\"#EConfigurations2/toolbar/PK\"#EConfigurations2/menubar/PK\"#E'Configurations2/accelerator/current.xmlPKPK\"#EConfigurations2/floater/PK\"#E
styles.xml�Ymo�6��_a(@��);�������KQd�`$J&J�I�q���H��Q��tK�4"��=|��;97�O���TL��`1��#�<�~A������$,��XDeFs��>r�f`��u5�	J��QL�s�Q���Z4wF�&zm]U#v�������IO56��-y�������$����6�1��Iq��� �uX<q��;��5���a~�����j��v��<�(%��8��S�L��|�6��L�g�MJy�=R9Y�I/���
 �]���j���k�N��}:"s�#rr�Yp;U.���r7m3�w#���w0i��~�+�M�e�-�"������M{!��j��n�.��
W�
��,� ������#�#����D���Iy��f��+���BjO$�~��:KTw:��G��:h*�x
t.1[84h����u����-���R�z���=6s�\�p!�7v�J,��+	��r�����F\mo����g��i��d&�p�$c���i��9���J��DLe��LGp8�D��w��d��v�����)5�wO��(:���C������,^�B�7]`5z��Qi�}
����'��;K�z�j'��&��u��V�)�F�<p��H�JR�P!!��f��TS��UD�b�4�M���:�N���Z���I���<-I
c4��(s-��o���!�������@���}���zU7�v`mS�8}�F����w�����&2�oo�^��V*M��A4��2�#�k��]�R�Hn
����J)��<�Z�W�C����P�M��eI�������M;b����^�=��"#;��e�������ep~�n6��,l[m�7$i�~�(�X������$	�����4�-����Z�J��>���UM�����|�"�o6>�HQ�ZQ�J�3��,�Q��:4��W�3���P�n��}��=U����m�_�P{g�R����&�En=.s����AF	�������%�#�:}�S�{�9(Kw��@n��+����ik��n�����WJ���_o4P����4G
z)S��+r�C�J��V�%����,� �i�M�!4h���g	�d����2���Y+��"I]8�H*E�.��4�Wa�$�t0��Inxocj�6����E�[{�+�E0��2�D�d��M�%�����`w���K���ufn�]�p��i��0�)�p���������G��y�
}w���T�y��	Pl���L���;3�(dl>�Y��V�������~Y@A��#�p.���>�
�����B����@@�Oi/Oz;���������wU=���P�V����5C�s0p�U��7����_�(��������G�����0�b�t��^����������v�n�_c�G�-��,���9i����(��)��~j
�!��b�^V8�4oB��W=!Z�ZC�u�CEG;*��<�.NV��B�
���6�����n��9��>����J�6�-�f�	�oV���	����������g�'PKb6,��2PK\"#EMETA-INF/manifest.xml�S�n� ��+"����	5�a����F�	��j�~I����I�����=c��������El��|���������E���M0�: ��������QGC�4������hKd�5_�J�2��Uu����z���%;@�L�C�F������ju�V��������t��$/�]��N��PB�e���jl�6[6r�}$)�8)z8����������A%�o�%|�q��C9e��r��5b����u��2�����l��}_1*8�������������7��7�~PK����PK\"#E�l9�..mimetypePK\"#E��DSSTmeta.xmlPK\"#E!���+�!�settings.xmlPK\"#Er���
E�2content.xmlPK\"#E�O,�g'g'yThumbnails/thumbnail.pngPK\"#E=Configurations2/images/Bitmaps/PK\"#ES=Configurations2/popupmenu/PK\"#E�=Configurations2/toolpanel/PK\"#E�=Configurations2/statusbar/PK\"#E�=Configurations2/progressbar/PK\"#E5>Configurations2/toolbar/PK\"#Ek>Configurations2/menubar/PK\"#E'�>Configurations2/accelerator/current.xmlPK\"#E�>Configurations2/floater/PK\"#Eb6,��2
.?styles.xmlPK\"#E����=EMETA-INF/manifest.xmlPK6�F

#28

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#25)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have updated the patch to address the feedback. Main changes are:

1. For populating freelist, have a separate process (bgreclaimer)
instead of doing it by bgwriter.
2. Autotune the low and high threshold values for buffers
in freelist. I have used the formula as suggested by you upthread.
3. Cleanup of locking regimen as discussed upthread (completely
eliminated BufFreelist Lock).
4. Improved comments and general code cleanup.

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

I don't mind changing it, but currently used title is based on similar
title "Background Writer's Processing". It is used in previous
paragraph. Is there a reason to title this differently?

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with

"bgreclaimer".

Again what I have used is matching with BgWriter's explanation. I thought
it would be better if wording used is similar.

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented.

It is based on the idea what bgwriter does for num_to_scan and
calling it once has advantage that we need to take freelist_lck
just once.

Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

Do you mean to say that for high water mark check, we should
always refer StrategySyncStartAndEnd() rather than getting the
value in begining?

Are you thinking that somebody else would have already put some
buffers onto freelist which would change initially identified high water
mark? I think that can be done only during very few and less used
operations. Do you think for that we start referring
StrategySyncStartAndEnd() in loop?

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance?

I have kept them separate so that backends searching for a buffer
in freelist doesn't contend with bgreclaimer (while doing clock sweep)
or clock sweep being done by other backends. I think it will be bit
tricky to devise a test where this can hurt, however it doesn't seem
too bad to have two separate locks in this case.

If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

I can investigate.

+            /*
+             * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+             * it; discard it and retry.  (This can only happen if

VACUUM put a

+             * valid buffer in the freelist and then someone else
used it before
+             * we got to it.  It's probably impossible altogether as
of 8.3, but
+             * we'd better check anyway.)
+             */
+

This comment is clearly obsolete.

Okay, but this patch hasn't changed anything w.r.t above comment,
so I haven't changed it. Do you want me to remove second part of
comment starting with "(This can only happen"?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#28)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

I don't mind changing it, but currently used title is based on similar
title "Background Writer's Processing". It is used in previous
paragraph. Is there a reason to title this differently?

Oh, I didn't see that. Seems like weird phrasing to me, but I guess
it's probably better to keep it consistent.

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with
"bgreclaimer".

Again what I have used is matching with BgWriter's explanation. I thought
it would be better if wording used is similar.

OK.

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented.

It is based on the idea what bgwriter does for num_to_scan and
calling it once has advantage that we need to take freelist_lck
just once.

Right, we shouldn't call it every loop iteration. However, consider
this scenario: there are no remaining buffers on the list and the high
watermark is 2000. We add 2000 buffers to the list. But by the time
we get done, other backends have already done 500 more allocations, so
now there are only 1500 buffers on the list. If this should occur, we
should add an additional 500 buffers to the list before we consider
sleeping. We want bgreclaimer to be able to run continuously if the
demand for buffers is high enough.

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance?

I have kept them separate so that backends searching for a buffer
in freelist doesn't contend with bgreclaimer (while doing clock sweep)
or clock sweep being done by other backends. I think it will be bit
tricky to devise a test where this can hurt, however it doesn't seem
too bad to have two separate locks in this case.

It's not. But if they are in the same cache line, they will behave
almost like one lock, because the CPU will lock the entire cache line
for each atomic op. See Tom's comments upthread.

Okay, but this patch hasn't changed anything w.r.t above comment,
so I haven't changed it. Do you want me to remove second part of
comment starting with "(This can only happen"?

Right. Clearly it can happen again once we have this patch: that's
the entire point of the patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 11 years ago

In reply to: Amit Kapila (#27)

Re: Scaling shared buffer eviction

On 03/09/14 16:22, Amit Kapila wrote:

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Aug 28, 2014 at 4:41 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.

I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.

I have collected more data with the patch. I understand that you
have given more review comments due to which patch require
changes, however I think it will not effect the performance data
to a great extent and I have anyway taken the data, so sharing the
same.

Performance Data:
-------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Forgot to mention that data is a median of 3 runs and attached
sheet contains data for individual runs.

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Mark Kirkwood (#30)

Re: Scaling shared buffer eviction

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
wrote:

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#32

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#29)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 8:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented.

It is based on the idea what bgwriter does for num_to_scan and
calling it once has advantage that we need to take freelist_lck
just once.

Right, we shouldn't call it every loop iteration. However, consider
this scenario: there are no remaining buffers on the list and the high
watermark is 2000. We add 2000 buffers to the list. But by the time
we get done, other backends have already done 500 more allocations, so
now there are only 1500 buffers on the list. If this should occur, we
should add an additional 500 buffers to the list before we consider
sleeping. We want bgreclaimer to be able to run continuously if the
demand for buffers is high enough.

Its not difficult to handle such cases, but it can have downside also
for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers. Another point to keep in mind here is that in this
loop we are reducing the usage_count of buffers as well incase we don't
find buffer with usage_count=0. OTOH if we let bgreclaimer to go for
sleep after it moves initially identified buffers, then the backend which
first finds that the buffers in freelist falls below low water mark can
wake bgreclaimer.

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance?

I have kept them separate so that backends searching for a buffer
in freelist doesn't contend with bgreclaimer (while doing clock sweep)
or clock sweep being done by other backends. I think it will be bit
tricky to devise a test where this can hurt, however it doesn't seem
too bad to have two separate locks in this case.

It's not. But if they are in the same cache line, they will behave
almost like one lock, because the CPU will lock the entire cache line
for each atomic op. See Tom's comments upthread.

I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both
the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
I have taken performance data as well by keeping them further apart
as suggested by you upthread and by introducing padding, but the
difference in performance is less than 1.5% (on 64 and 128 client count)
which also might be due to variation of data across runs. So now to
proceed we have below options:

a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently
in patch, but in future we can adding padding to take the advantage if
possible (currently on PPC64, it doesn't show any noticeable advantage,
however on some other m/c, it might show the advantage).

b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.

Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#33

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#26)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 9:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Performance Data:
-------------------------------

Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Common configuration remains same as above.

Shared_Buffers = 500MB
Client Count/Patch_Ver 8 16 32 64 128 HEAD 56248 100112 121341 81128
56552 Patch 59389 112483 157034 185740 166725
..

Observations
---------------------
1. Performance improvement is upto 2~3 times for higher client
counts (64, 128).
2. For lower client count (8), we can see 2~5 % performance
improvement.
3. Overall, this improves the read scalability.
4. For lower number of shared buffers, we see that there is a minor
dip in tps even after patch (it might be that we can improve it by
tuning higher water mark for the number of buffers on freelist, I will
try this by varying high water mark).

I have taken performance data by varying high and low mater marks
for lower value of shared buffers which is as below:

Shared_buffers = 500MB
Scale_factor = 3000

HM - High water mark, 0.5 means 0.5% of total shared buffers
LM - Low water mark, 20 means 20% of HM.

Client Count/Patch_Ver (Data in tps) 128 HM=0.5;LM=20 166725 HM=1;LM=20
166556 HM=2;LM=30 166463 HM=5;LM=30 166107 HM=10;LM=30 167231

Observation
--------------------
a. There is hardly any difference by varying High and Low water marks
as compared to default values currently used in patch.
b. I think this minor dip as compare to 64 client count is because one
this m/c has 64 hardware threads due which scaling beyond 64 client
count is difficult and second at relatively lower buffer count (500MB),
there is still minor contention around Buf Mapping locks.

In general, I think with patch the scaling is much better (2 times) than
HEAD, even when shared buffers are less and client count is high,
so this is not an issue.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#34

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#32)

Re: Scaling shared buffer eviction

On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Its not difficult to handle such cases, but it can have downside also
for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers.

That's exactly the scenario in which we *want* the looping behavior.
If that's happening, then it means it's taking us exactly as long to
find 5 buffers as it takes the rest of the system to use 5 buffers.
We need to run continuously to keep up.

It's not. But if they are in the same cache line, they will behave
almost like one lock, because the CPU will lock the entire cache line
for each atomic op. See Tom's comments upthread.

I think to avoid having them in same cache line, we might need to
add some padding (at least 72 bytes) as the structure size including both
the spin locks is 56 bytes on PPC64 m/c and cache line size is 128 bytes.
I have taken performance data as well by keeping them further apart
as suggested by you upthread and by introducing padding, but the
difference in performance is less than 1.5% (on 64 and 128 client count)
which also might be due to variation of data across runs. So now to
proceed we have below options:

a. use two spinlocks as in patch, but keep them as far apart as possible.
This might not have an advantage as compare to what is used currently
in patch, but in future we can adding padding to take the advantage if
possible (currently on PPC64, it doesn't show any noticeable advantage,
however on some other m/c, it might show the advantage).

b. use only one spinlock, this can have disadvantage in certain cases
as mentioned upthread, however those might not be usual cases, so for
now we can consider them as lower priority and can choose this option.

I guess I don't care that much. I only mentioned it because Tom
brought it up; I don't really see a big problem with the way you're
doing it.

Another point in this regard is that I have to make use of volatile
pointer to prevent code rearrangement in this case.

Yep. Or we need to get off our duff and fix it so that's not necessary.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Kevin Grittner

kgrittn@ymail.com

over 11 years ago

In reply to: Robert Haas (#34)

Re: Scaling shared buffer eviction

Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 4, 2014 at 7:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Its not difficult to handle such cases, but it can have downside also
for the cases where demand from backends is not high.
Consider in above case if instead of 500 more allocations, it just
does 5 more allocations, then bgreclaimer will again have to go through
the list and move 5 buffers and same can happen again by the time
it moves 5 buffers.

That's exactly the scenario in which we *want* the looping behavior.
If that's happening, then it means it's taking us exactly as long to
find 5 buffers as it takes the rest of the system to use 5 buffers.
We need to run continuously to keep up.

That's what I was thinking, as long as there isn't a lot of
overhead to starting and finishing a cycle. If there is, my
inclination would be to try to fix that rather than to sleep and
hope things don't get out of hand before it wakes up again.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Alvaro Herrera

alvherre@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#29)

Re: Scaling shared buffer eviction

Robert Haas wrote:

On Wed, Sep 3, 2014 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

I don't mind changing it, but currently used title is based on similar
title "Background Writer's Processing". It is used in previous
paragraph. Is there a reason to title this differently?

Oh, I didn't see that. Seems like weird phrasing to me, but I guess
it's probably better to keep it consistent.

... or you can also change the other one.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 11 years ago

In reply to: Amit Kapila (#31)

Re: Scaling shared buffer eviction

On 04/09/14 14:42, Amit Kapila wrote:

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
wrote:

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.

FWIW below are some test results on the 60 core beast with this patch
applied to 9.4. I'll need to do more runs to iron out the variation, but
it looks like the patch helps the standard (write heavy) pgbench
workload a little, and clearly helps the read only case.

4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04 (Linux 3.13)
Postgres 9.4 beta2
+ buffer eviction patch v5

Pgbench

scale 2000

Non default params:

max_connections = 400;
shared_buffers = "10GB";
maintenance_work_mem = "1GB";
effective_io_concurrency = 10;
wal_buffers = "256MB";
checkpoint_segments = 1920;
checkpoint_completion_target = 0.8;
ssl = 'off';
wal_sync_method = 'open_datasync';

read write

elapsed 600s

Clients | tps | tps (unpatched)
---------+-------+----------------
6 | 8279 | 8328
12 | 16260 | 16381
24 | 23639 | 23451
48 | 31430 | 31004
96 | 38516 | 34777
192 | 33535 | 32443
384 | 27978 | 25068
384 | 30589 | 28798

read only

elapsed 300s

Clients | tps | tps (unpatched)
---------+--------+----------------
6 | 57654 | 57255
12 | 111361 | 112360
24 | 220304 | 187967
48 | 384567 | 230961
96 | 380309 | 241947
192 | 330865 | 214570
384 | 315516 | 207548

Regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#25)

2 attachment(s)

Re: Scaling shared buffer eviction

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I have updated the patch to address the feedback. Main changes are:

1. For populating freelist, have a separate process (bgreclaimer)
instead of doing it by bgwriter.
2. Autotune the low and high threshold values for buffers
in freelist. I have used the formula as suggested by you upthread.
3. Cleanup of locking regimen as discussed upthread (completely
eliminated BufFreelist Lock).
4. Improved comments and general code cleanup.

Overall this looks quite promising to me.

I had thought to call the new process just "bgreclaim" rather than
"bgreclaimer", but perhaps your name is better after all. At least,
it matches what we do elsewhere. But I don't care for the use
"Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
else "bgreclaimer".

Changed it to bgreclaimer.

This is unclear:

+buffers for replacement. Earlier to protect freelist, we use LWLOCK as

that

+is needed to perform clock sweep which is a longer operation, however

now we

+are using two spinklocks freelist_lck and victimbuf_lck to perform

operations

+on freelist and run clock sweep respectively.

I would drop the discussion of what was done before and say something
like this: The data structures relating to buffer eviction are
protected by two spinlocks. freelist_lck protects the freelist and
related data structures, while victimbuf_lck protects information
related to the current clock sweep condition.

Changed, but I have not used exact wording mentioned above, let me know
if new wording used is okay?

+always in this list.  We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background

process

+reclaimer. The list is singly-linked using fields in the

I suggest: Allocating pages from this list is much cheaper than
running the "clock sweep" algorithm, which may encounter many buffers
that are poor candidates for eviction before finding a good candidate.
Therefore, we have a background process called bgreclaimer which works
to keep this list populated.

Changed as per your suggestion.

+Background Reclaimer's Processing
+---------------------------------

I suggest titling this section "Background Reclaim".

+The background reclaimer is designed to move buffers to freelist that are

I suggest replacing the first three words of this sentence with

"bgreclaimer".

As per discussion in thread, I have kept it as it is.

+and move the the unpinned and zero usage count buffers to freelist.  It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.
s/keep/keeps/
s/become equal/reaches the/
s/high threshold/high water mark/
s/of freelist//

Changed as per your suggestion.

Please change the other places that say threshold to use the "water
mark" terminology.

+ if (StrategyMoveBufferToFreeListEnd (bufHdr))

Extra space.

+ * buffers in consecutive cycles.

s/consecutive/later/

+ /* Execute the LRU scan */

s/LRU scan/clock sweep/ ?

Changed as per your suggestion.

+ while (tmp_num_to_free > 0)

I am not sure it's a good idea for this value to be fixed at loop
start and then just decremented. Shouldn't we loop and do the whole
thing over once we reach the high watermark, only stopping when
StrategySyncStartAndEnd() says num_to_free is 0?

Okay, changed the loop as per your suggestion.

+ /* choose next victim buffer to clean. */

This process doesn't clean buffers; it puts them on the freelist.

Right. Changed it to match what it does.

+ * high threshold of freelsit), we drasticaly reduce the odds for

Two typos.

Fixed.

+ * of buffers in freelist fall below low threshold of freelist.

s/fall/falls/

Changed as per your suggestion.

In freelist.c, it seems like a poor idea to have two spinlocks as
consecutive structure members; they'll be in the same cache line,
leading to false sharing. If we merge them into a single spinlock,
does that hurt performance? If we put them further apart, e.g. by
moving the freelist_lck to the start of the structure, followed by the
latches, and leaving victimbuf_lck where it is, does that help
performance?

As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.

+            /*
+             * If the buffer is pinned or has a nonzero usage_count,
we cannot use
+             * it; discard it and retry.  (This can only happen if

VACUUM put a

+             * valid buffer in the freelist and then someone else
used it before
+             * we got to it.  It's probably impossible altogether as
of 8.3, but
+             * we'd better check anyway.)
+             */
+

This comment is clearly obsolete.

Removed.

I have not yet added statistics (buffers_backend_clocksweep) as
for that we need to add one more variable in BufferStrategyControl
structure where I have already added few variables for this patch.
I think it is important to have such a stat available via
pg_stat_bgwriter, but not sure if it is worth to make the structure
bit more bulky.

I think it's worth it.

Okay added new statistic.

Another minor point is about changes in lwlock.h
lwlock.h
* if you remove a lock, consider leaving a gap in the numbering
* sequence for the benefit of DTrace and other external debugging
* scripts.

As I have removed BufFreelist lock, I have adjusted the numbering
as well in lwlock.h. There is a meesage on top of lock definitions
which suggest to leave gap if we remove any lock, however I was not
sure whether this case (removing the first element) can effect anything,
so for now, I have adjusted the numbering.

Let's leave slot 0 unused, instead.

Sure, that make sense.

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

Performance Data with updated patch

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717
104357 65010 Patch 60092 113564 165014 213848 216065

This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest version.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v6.patchapplication/octet-stream; name=scalable_buffer_eviction_v6.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5e9e735..48d642a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1bde175..f4717c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -712,6 +712,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+		pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..3df2337
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).  In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.  However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
+ *
+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, BgreclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReqShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, bgreclaim_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, bgreclaim_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 */
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgwriter is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * bgreclaim_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+bgreclaim_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+BgreclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+bgreclaim_sigusr1_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 14535c8..565cf4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4778,6 +4820,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5122,6 +5166,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..dfb9cb5 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
+exclusion for operations that access the buffer freelist or select
+buffers for replacement.  It also contains victimbuf_lck that protects
+information related to the current clock sweep condition.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +156,20 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  Allocating pages from this list is much cheaper than
+running the "clock sweep" algorithm, which may encounter many buffers
+that are poor candidates for eviction before finding a good candidate.
+Therefore, we have a background process called bgreclaimer which works
+to keep this list populated.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,19 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends.  To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist.  It
+keeps on doing this until the number of buffers in freelist reaches the
+high water mark.
+
+Two water mark indicators are used to maintain sufficient number of buffers
+on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..7df657c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1933,6 +1925,93 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high water mark of freelist), we drastically reduce the odds for
+ * backend's to perform clock sweep.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	volatile uint32	next_to_clean;
+	uint32	tmp_num_to_free;
+	uint32	num_to_free = 0;
+	uint32  tmp_next_to_clean;
+	volatile BufferDesc *bufHdr;
+	uint32	tmp_recent_alloc;
+	uint32	recent_alloc = 0;
+	uint32  tmp_recent_backend_clocksweep;
+	uint32  recent_backend_clocksweep = 0;
+	
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		StrategySyncStartAndEnd(&tmp_next_to_clean,
+								&tmp_num_to_free,
+								&tmp_recent_alloc,
+								&tmp_recent_backend_clocksweep);
+
+		num_to_free += tmp_num_to_free;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+		next_to_clean = tmp_next_to_clean;
+
+		if (tmp_num_to_free == 0)
+			break;
+
+		while (tmp_num_to_free > 0)
+		{
+			bufHdr = &BufferDescriptors[next_to_clean];
+
+			LockBufHdr(bufHdr);
+
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+				{
+					/*
+					 * Reduce usage count so that we can find the reusable
+					 * buffers in later cycles.
+					 */
+					bufHdr->usage_count--;
+					UnlockBufHdr(bufHdr);
+				}
+				else
+				{
+					UnlockBufHdr(bufHdr);
+					if (StrategyMoveBufferToFreeListEnd(bufHdr))
+						tmp_num_to_free--;
+				}
+			}
+			else
+				UnlockBufHdr(bufHdr);
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_to_clean);
+		}
+	}
+
+	/*
+	 * Report buffer alloc and buffer request not satisfied
+	 * from freelist counts to pgstat.
+	 */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_to_clean=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_to_clean, num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..c4ee126 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -37,19 +38,48 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * protects freelist and related variables (firstFreeBuffer,
+	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -84,6 +114,19 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high water mark and then
+ * again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
 
 /* Prototypes for internal functions */
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -101,67 +144,51 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
+	volatile BufferDesc *buf = NULL;
 	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
 	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
 	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
-	{
 		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,28 +196,86 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		/* Unconditionally remove buffer from freelist */
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	/*
+	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+	 * not do so while holding freelist_lck; so set it after releasing the
+	 * freelist_lck.  This is annoyingly tedious, but it happens at most once
+	 * per bgwriter cycle, so the performance hit is minimal.
+	 */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
 
+	if (buf != NULL)
+	{
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
 		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
+		SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
 
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
@@ -199,6 +284,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			StrategyControl->completePasses++;
 		}
 
+		SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +328,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +340,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -274,20 +400,79 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategySyncStartAndEnd -- tell bgreclaimer where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc,
+						uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*start = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*end = freelistHighWaterMark - curfreebuffers;
+	else
+		*end = 0;
+
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+	if (num_buf_backend_clocksweep)
+	{
+		*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+		StrategyControl->numBufferBackendClocksweep = 0;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+		StrategyControl->nextVictimBuffer = 0;
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +484,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +573,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -383,12 +581,35 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist, if based on water mark
+	 * calculation count of buffers on freelist goes beyond 2000 or
+	 * lesser than 5, then we set it to hard coded values.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighWaterMark < 5)
+		freelistHighWaterMark = 5;
+	else if (freelistHighWaterMark > 2000)
+		freelistHighWaterMark = 2000;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..1265bb1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3218 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3807955..8e58fb4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..5c30ec7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,19 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void  StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+									uint32 *num_buf_alloc,
+									uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..754a838 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,6 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..24ac85a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1671,6 +1671,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+	pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

perf_read_scalability_data_v6.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v6.odsDownload

#39

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Mark Kirkwood (#37)

Re: Scaling shared buffer eviction

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
wrote:

On 04/09/14 14:42, Amit Kapila wrote:

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood <

mark.kirkwood@catalyst.net.nz>

wrote:

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.

FWIW below are some test results on the 60 core beast with this patch

applied to 9.4. I'll need to do more runs to iron out the variation,

but it looks like the patch helps the standard (write heavy) pgbench

workload a little, and clearly helps the read only case.

Thanks for doing the test. I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM. You might want to use
latest patch I have posted today.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#40

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#38)

1 attachment(s)

Re: Scaling shared buffer eviction

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v7.patchapplication/octet-stream; name=scalable_buffer_eviction_v7.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5e9e735..48d642a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1bde175..f4717c6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -712,6 +712,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+		pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..3df2337
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).  In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.  However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
+ *
+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, BgreclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReqShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, bgreclaim_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, bgreclaim_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 */
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgwriter is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * bgreclaim_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+bgreclaim_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+BgreclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+bgreclaim_sigusr1_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 14535c8..565cf4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4778,6 +4820,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5122,6 +5166,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..dfb9cb5 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
+exclusion for operations that access the buffer freelist or select
+buffers for replacement.  It also contains victimbuf_lck that protects
+information related to the current clock sweep condition.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +156,20 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  Allocating pages from this list is much cheaper than
+running the "clock sweep" algorithm, which may encounter many buffers
+that are poor candidates for eviction before finding a good candidate.
+Therefore, we have a background process called bgreclaimer which works
+to keep this list populated.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,19 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends.  To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist.  It
+keeps on doing this until the number of buffers in freelist reaches the
+high water mark.
+
+Two water mark indicators are used to maintain sufficient number of buffers
+on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..d2352ea 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1933,6 +1925,92 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high water mark of freelist), we drastically reduce the odds for
+ * backend's to perform clock sweep.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	volatile uint32	next_to_clean;
+	uint32	tmp_num_to_free;
+	uint32	num_to_free = 0;
+	uint32  tmp_next_to_clean;
+	volatile BufferDesc *bufHdr;
+	uint32	tmp_recent_alloc;
+	uint32	recent_alloc = 0;
+	uint32  tmp_recent_backend_clocksweep;
+	uint32  recent_backend_clocksweep = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		StrategySyncStartAndEnd(&tmp_next_to_clean,
+								&tmp_num_to_free,
+								&tmp_recent_alloc,
+								&tmp_recent_backend_clocksweep);
+
+		num_to_free += tmp_num_to_free;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+		next_to_clean = tmp_next_to_clean;
+
+		if (tmp_num_to_free == 0)
+			break;
+
+		while (tmp_num_to_free > 0)
+		{
+			bufHdr = &BufferDescriptors[next_to_clean];
+
+			LockBufHdr(bufHdr);
+
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+				{
+					/*
+					 * Reduce usage count so that we can find the reusable
+					 * buffers in later cycles.
+					 */
+					bufHdr->usage_count--;
+					UnlockBufHdr(bufHdr);
+				}
+				else
+				{
+					UnlockBufHdr(bufHdr);
+					if (StrategyMoveBufferToFreeListEnd(bufHdr))
+						tmp_num_to_free--;
+				}
+			}
+			else
+				UnlockBufHdr(bufHdr);
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_to_clean);
+		}
+	}
+
+	/*
+	 * Report buffer alloc and buffer request not satisfied
+	 * from freelist counts to pgstat.
+	 */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_to_clean=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_to_clean, num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..c4ee126 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -37,19 +38,48 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * protects freelist and related variables (firstFreeBuffer,
+	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -84,6 +114,19 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high water mark and then
+ * again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
 
 /* Prototypes for internal functions */
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -101,67 +144,51 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
+	volatile BufferDesc *buf = NULL;
 	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
 	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
 	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
-	{
 		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,28 +196,86 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		/* Unconditionally remove buffer from freelist */
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	/*
+	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+	 * not do so while holding freelist_lck; so set it after releasing the
+	 * freelist_lck.  This is annoyingly tedious, but it happens at most once
+	 * per bgwriter cycle, so the performance hit is minimal.
+	 */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * Ideally numFreeListBuffers should get called under freelist spinlock,
+	 * however here we need this number for estimating approximate number of
+	 * free buffers required on freelist, so it should not be a problem, even
+	 * if numFreeListBuffers is not exact.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
 
+	if (buf != NULL)
+	{
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
 		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
+		SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
 		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
 
 		if (++StrategyControl->nextVictimBuffer >= NBuffers)
@@ -199,6 +284,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			StrategyControl->completePasses++;
 		}
 
+		SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +328,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +340,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -274,20 +400,79 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategySyncStartAndEnd -- tell bgreclaimer where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc,
+						uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*start = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*end = freelistHighWaterMark - curfreebuffers;
+	else
+		*end = 0;
+
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+	if (num_buf_backend_clocksweep)
+	{
+		*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+		StrategyControl->numBufferBackendClocksweep = 0;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+		StrategyControl->nextVictimBuffer = 0;
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +484,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +573,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -383,12 +581,35 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist, if based on water mark
+	 * calculation count of buffers on freelist goes beyond 2000 or
+	 * lesser than 5, then we set it to hard coded values.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighWaterMark < 5)
+		freelistHighWaterMark = 5;
+	else if (freelistHighWaterMark > 2000)
+		freelistHighWaterMark = 2000;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..1265bb1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3218 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3807955..8e58fb4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..5c30ec7 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,19 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void  StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+									uint32 *num_buf_alloc,
+									uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..754a838 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,6 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..939075e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1671,6 +1671,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+    pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

#41

Merlin Moncure

mmoncure@gmail.com

over 11 years ago

In reply to: Amit Kapila (#38)

Re: Scaling shared buffer eviction

On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Client Count/Patch_Ver (tps) 8 16 32 64 128
HEAD 58614 107370 140717 104357 65010
Patch 60092 113564 165014 213848 216065

This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest version.

Well, these numbers are pretty much amazing. Question: It seems
there's obviously quite a bit of contention left; do you think
there's still a significant amount of time in the clock sweep, or is
the primary bottleneck the buffer mapping locks?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#40)

Re: Scaling shared buffer eviction

On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

I took another read through this patch. Here are some further review comments:

1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
both num_to_free and tmp_num_to_free. I'd get rid of tmp_num_to_free
and move the declaration of num_to_free inside the outer loop. I'd
also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
tmp_recent_backend_clocksweep into the innermost scope in which they
are used.

2. Also in that function, I think the innermost bit of logic could be
rewritten more compactly, and in such a way as to make it clearer for
what set of instructions the buffer header will be locked.
LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
(bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
= true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

3. This comment is now obsolete:

+       /*
+        * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+        * not do so while holding freelist_lck; so set it after releasing the
+        * freelist_lck.  This is annoyingly tedious, but it happens
at most once
+        * per bgwriter cycle, so the performance hit is minimal.
+        */
+

We're not actually holding any lock in need of releasing at that point
in the code, so this can be shortened to "If bgwriterLatch is set, we
need to waken the bgwriter."

* Ideally numFreeListBuffers should get called under freelist spinlock,

That doesn't make any sense. numFreeListBuffers is a variable, so you
can't "call" it. The value should be *read* under the spinlock, but
it is. I think this whole comment can be deleted and replaced with
"If the number of free buffers has fallen below the low water mark,
awaken the bgreclaimer to repopulate it."

4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
the same victim buffer that's being handed out - at almost the same
time - to a backend running the clock sweep; if it does, they'll fight
over the buffer. Two, the *end out parameter actually returns a
count, not an endpoint. I think we should have
BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
top of the inner loop rather than the bottom, and change
StrategySyncStartAndEnd() so that it knows nothing about
victimbuf_lck. Let's also change StrategyGetBuffer() to call
StrategySyncNextVictimBuffer() so that the logic is centralized in one
place, and rename StrategySyncStartAndEnd() to something that better
matches its revised purpose. Maybe StrategyGetReclaimInfo().

5. Have you tested that this new bgwriter statistic is actually
working? Because it looks to me like BgMoveBuffersToFreelist is
changing BgWriterStats but never calling pgstat_send_bgwriter(), which
I'm thinking will mean the counters accumulate forever inside the
reclaimer but never get forwarded to the stats collector.

6. StrategyInitialize() uses #defines for
HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
for clamping. Let's have constants for all of those (and omit mention
of the specific values in the comments).

7. The line you've added to the definition of view pg_stat_bgwriter
doesn't seem to be indented the same way as all the others. Tab vs.
space problem?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Thom Brown

thom@linux.com

over 11 years ago

In reply to: Amit Kapila (#40)

Re: Scaling shared buffer eviction

On 5 September 2014 14:19, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

Here's a set of test results against this patch:

8-core AMD FX-9590, 32GB RAM

Config:
checkpoint_segments = 256
checkpoint_timeout = 15min
shared_buffers = 8GB

pgbench scale factor = 3000
run duration time = 5min

HEAD
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31568 41890 48638 49266 41845
Run-2 31613 41879 48597 49332 41736
Run-3 31674 41987 48647 49320 41745

Patch=scalable_buffer_eviction_v7.patch
Client Count/No. Of Runs (tps) 8 16 32 64 128
Run-1 31880 42943 49359 49901 43779
Run-2 32150 43078 48934 49828 43769
Run-3 32323 42894 49481 49600 43529

--
Thom

#44

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Merlin Moncure (#41)

Re: Scaling shared buffer eviction

On Mon, Sep 8, 2014 at 10:12 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Fri, Sep 5, 2014 at 6:47 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Client Count/Patch_Ver (tps) 8 16 32 64 128
HEAD 58614 107370 140717 104357 65010
Patch 60092 113564 165014 213848 216065

This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest

version.

Well, these numbers are pretty much amazing. Question: It seems
there's obviously quite a bit of contention left; do you think
there's still a significant amount of time in the clock sweep, or is
the primary bottleneck the buffer mapping locks?

I think contention around clock sweep is very minimal and for buffer
mapping locks it has reduced significantly (you can refer upthread
some of LWLOCK stat data, I have posted after this patch), but
might be we can get more out of it by improving hash table
concurrency. However apart from both of the above, the next thing
I have seen in profiles was palloc (at least that is what I remember
when I had done some profiling few months back during the
development of this patch). It seems to me at that time a totally
different optimisation, so I left it for another patch.
Another point is that the m/c on which I am doing performance
test has 16 cores (64 hardware threads), so not sure how much
scaling we can expect.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#45

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Thom Brown (#43)

Re: Scaling shared buffer eviction

On Tue, Sep 9, 2014 at 3:11 AM, Thom Brown <thom@linux.com> wrote:

On 5 September 2014 14:19, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

Here's a set of test results against this patch:

Many thanks for taking the performance data. This data clearly shows
that there is a performance improvement at even lower configuration,
however the real benefit of the patch can be seen with higher core
m/c and with larger RAM (can contain all the data).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#46

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 11 years ago

In reply to: Amit Kapila (#39)

Re: Scaling shared buffer eviction

On 05/09/14 23:50, Amit Kapila wrote:

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
wrote:

On 04/09/14 14:42, Amit Kapila wrote:

On Thu, Sep 4, 2014 at 8:00 AM, Mark Kirkwood

<mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>

wrote:

Hi Amit,

Results look pretty good. Does it help in the read-write case too?

Last time I ran the tpc-b test of pgbench (results of which are
posted earlier in this thread), there doesn't seem to be any major
gain for that, however for cases where read is predominant, you
might see better gains.

I am again planing to take that data in next few days.

FWIW below are some test results on the 60 core beast with this patch

applied to 9.4. I'll need to do more runs to iron out the variation,

but it looks like the patch helps the standard (write heavy) pgbench

workload a little, and clearly helps the read only case.

Thanks for doing the test. I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM. You might want to use
latest patch I have posted today.

Here's some fairly typical data from read-write and read-only runs at
scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not
seeing much variation between repeated read-write runs with the same
config (which is nice - sleep 30 and explicit checkpoint call between
each one seem to help there).

Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be
better at higher client counts than beta1 was...

In terms of the effect of the patch - looks pretty similar to the scale
2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

read write (600s)

Clients | tps | tps (unpatched)
---------+--------+----------------
6 | 9395 | 9334
12 | 16605 | 16525
24 | 24634 | 24910
48 | 32170 | 31275
96 | 35675 | 36533
192 | 35579 | 31137
384 | 30528 | 28308

read only (300s)

Clients | tps | tps (unpatched)
---------+--------+----------------
6 | 35743 | 35362
12 | 111019 | 106579
24 | 199746 | 160305
48 | 327026 | 198407
96 | 379184 | 171863
192 | 356623 | 152224
384 | 340878 | 128308

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Michael Paquier

michael.paquier@gmail.com

over 11 years ago

In reply to: Robert Haas (#42)

Re: Scaling shared buffer eviction

On Tue, Sep 9, 2014 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One regression failed on linux due to spacing issue which is
fixed in attached patch.

I just read the latest patch by curiosity, wouldn't it make more sense
to split the patch into two different patches for clarity: one for the
reclaimer worker centralized around BgMoveBuffersToFreelist and a
second for the pg_stat portion? Those seem two different features.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#42)

1 attachment(s)

Re: Scaling shared buffer eviction

On Tue, Sep 9, 2014 at 12:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 5, 2014 at 9:19 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Sep 5, 2014 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.

One regression failed on linux due to spacing issue which is
fixed in attached patch.

I took another read through this patch. Here are some further review

comments:

1. In BgMoveBuffersToFreelist(), it seems quite superfluous to have
both num_to_free and tmp_num_to_free.

num_to_free is used to accumulate total number of buffers that are
freed in one cycle of BgMoveBuffersToFreelist() which is reported
for debug info (BGW_DEBUG) and tmp_num_to_free is used as a temporary
number which is a count of number of buffers to be freed in one sub-cycle
(inner while loop)

I'd get rid of tmp_num_to_free
and move the declaration of num_to_free inside the outer loop. I'd
also move the definitions of tmp_next_to_clean, tmp_recent_alloc,
tmp_recent_backend_clocksweep into the innermost scope in which they
are used.

okay, I have moved the tmp_* variables in innermost scope.

2. Also in that function, I think the innermost bit of logic could be
rewritten more compactly, and in such a way as to make it clearer for
what set of instructions the buffer header will be locked.
LockBufHdr(bufHdr); if (bufHdr->refcount == 0) { if
(bufHdr->usage_count > 0) bufHdr->usage_count--; else add_to_freelist
= true; } UnlockBufHdr(bufHdr); if (add_to_freelist &&
StrategyMoveBufferToFreeListEnd(bufHdr)) num_to_free--;

Changed as per suggestion.

3. This comment is now obsolete:

+       /*
+        * If bgwriterLatch is set, we need to waken the bgwriter, but we

should

+ * not do so while holding freelist_lck; so set it after

releasing the

+        * freelist_lck.  This is annoyingly tedious, but it happens
at most once
+        * per bgwriter cycle, so the performance hit is minimal.
+        */
+
We're not actually holding any lock in need of releasing at that point
in the code, so this can be shortened to "If bgwriterLatch is set, we
need to waken the bgwriter."

Changed as per suggestion.

* Ideally numFreeListBuffers should get called under freelist

spinlock,

That doesn't make any sense. numFreeListBuffers is a variable, so you
can't "call" it. The value should be *read* under the spinlock, but
it is. I think this whole comment can be deleted and replaced with
"If the number of free buffers has fallen below the low water mark,
awaken the bgreclaimer to repopulate it."

Changed as per suggestion.

4. StrategySyncStartAndEnd() is kind of a mess. One, it can return
the same victim buffer that's being handed out - at almost the same
time - to a backend running the clock sweep; if it does, they'll fight
over the buffer. Two, the *end out parameter actually returns a
count, not an endpoint. I think we should have
BgMoveBuffersToFreelist() call StrategySyncNextVictimBuffer() at the
top of the inner loop rather than the bottom, and change
StrategySyncStartAndEnd() so that it knows nothing about
victimbuf_lck. Let's also change StrategyGetBuffer() to call
StrategySyncNextVictimBuffer() so that the logic is centralized in one
place, and rename StrategySyncStartAndEnd() to something that better
matches its revised purpose.

Changed as per suggestion. I have also updated
StrategySyncNextVictimBuffer() such that it increments completePasses
on completion of cycle as I think it is appropriate to update it, even when
clock sweep is done by bgreclaimer.

Maybe StrategyGetReclaimInfo().

I have changed it to StrategyGetFreelistAccessInfo() as it seems most
other functions in freelist.c uses the names to sound something related
to buffers.

5. Have you tested that this new bgwriter statistic is actually
working? Because it looks to me like BgMoveBuffersToFreelist is
changing BgWriterStats but never calling pgstat_send_bgwriter(), which
I'm thinking will mean the counters accumulate forever inside the
reclaimer but never get forwarded to the stats collector.

pgstat_send_bgwriter() is called in bgreclaimer loop (caller of
BgMoveBuffersToFreelist, this is similar to how bgwriter does).
I have done few tests with it before sending the previous patch.

6. StrategyInitialize() uses #defines for
HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT and
LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT but inline constants (5, 2000)
for clamping. Let's have constants for all of those (and omit mention
of the specific values in the comments).

Changed as per suggestion.

7. The line you've added to the definition of view pg_stat_bgwriter
doesn't seem to be indented the same way as all the others. Tab vs.
space problem?

Fixed.

Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client Count/Patch_ver 8 16 32 64 128 HEAD 58614 107370 140717 104357
65010 Patch 61825 115152 170952 217389 220994

Observation
--------------------
1. The scalability/performance is similar to previous patch, slightly
better at higher client count.
2. I have taken the performance data just for one set of configuration,
as there doesn't seem to be any fundamental change which can impact
performance.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v8.patchapplication/octet-stream; name=scalable_buffer_eviction_v8.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5e9e735..48d642a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1bde175..582d3c0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -712,6 +712,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+        pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..3df2337
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).  In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.  However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
+ *
+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, BgreclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReqShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, bgreclaim_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, bgreclaim_sigusr1_handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 */
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgwriter is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * bgreclaim_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+bgreclaim_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+BgreclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+bgreclaim_sigusr1_handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 14535c8..565cf4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4778,6 +4820,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5122,6 +5166,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..dfb9cb5 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
+exclusion for operations that access the buffer freelist or select
+buffers for replacement.  It also contains victimbuf_lck that protects
+information related to the current clock sweep condition.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +156,20 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  Allocating pages from this list is much cheaper than
+running the "clock sweep" algorithm, which may encounter many buffers
+that are poor candidates for eviction before finding a good candidate.
+Therefore, we have a background process called bgreclaimer which works
+to keep this list populated.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,19 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends.  To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist.  It
+keeps on doing this until the number of buffers in freelist reaches the
+high water mark.
+
+Two water mark indicators are used to maintain sufficient number of buffers
+on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..524fe20 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1933,6 +1925,87 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high water mark of freelist), we drastically reduce the odds for
+ * backend's to perform clock sweep.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	uint32	num_to_free = 0;
+	uint32	recent_alloc = 0;
+	uint32  recent_backend_clocksweep = 0;
+	volatile uint32	next_victim = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		uint32	tmp_num_to_free;
+		uint32	tmp_recent_alloc;
+		uint32  tmp_recent_backend_clocksweep;
+
+		StrategyGetFreelistAccessInfo(&tmp_num_to_free,
+									  &tmp_recent_alloc,
+									  &tmp_recent_backend_clocksweep);
+
+		num_to_free += tmp_num_to_free;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+
+		if (tmp_num_to_free == 0)
+			break;
+
+		while (tmp_num_to_free > 0)
+		{
+			volatile BufferDesc *bufHdr;
+			bool	add_to_freelist = false;
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_victim);
+
+			bufHdr = &BufferDescriptors[next_victim];
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * move it to freelist; decrement the usage_count (unless pinned)
+			 * and keep scanning.
+			 */
+			LockBufHdr(bufHdr);
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+					bufHdr->usage_count--;
+				else
+					add_to_freelist = true;
+			}
+			UnlockBufHdr(bufHdr);
+
+			if (add_to_freelist && StrategyMoveBufferToFreeListEnd(bufHdr))
+				tmp_num_to_free--;
+		}
+	}
+
+	/*
+	 * Report buffer alloc and buffer request not satisfied
+	 * from freelist counts to pgstat.
+	 */
+	BgWriterStats.m_buf_alloc += recent_alloc;
+	BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_victim=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_victim, num_to_free);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..f6cd5f0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -37,19 +38,48 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * protects freelist and related variables (firstFreeBuffer,
+	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -84,6 +114,21 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high water mark and then
+ * again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
+#define MIN_HIGH_WATER_MARK	5
+#define MAX_HIGH_WATER_MARK	2000
 
 /* Prototypes for internal functions */
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -101,67 +146,51 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
+	volatile BufferDesc *buf = NULL;
 	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
 	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
 	 */
 	StrategyControl->numBufferAllocs++;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
 	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
-	{
 		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,35 +198,82 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		/* Unconditionally remove buffer from freelist */
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
 
+	/* If bgwriterLatch is set, we need to waken the bgwriter */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * If the number of free buffers has fallen below the low water mark,
+	 * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
 		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
 		}
-		UnlockBufHdr(buf);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		volatile uint32	next_victim;
 
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
+		StrategySyncNextVictimBuffer(&next_victim);
+
+		buf = &BufferDescriptors[next_victim];
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,11 +329,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
+	}
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
 	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
 
-	LWLockRelease(BufFreelistLock);
+	return freed;
 }
 
+
 /*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
@@ -274,20 +389,77 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
+ * to move unused buffers to freelist.
+ *
+ * The result is the number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
+							  uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*num_buf_to_free = freelistHighWaterMark - curfreebuffers;
+	else
+		*num_buf_to_free = 0;
+
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+	if (num_buf_backend_clocksweep)
+	{
+		*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+		StrategyControl->numBufferBackendClocksweep = 0;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+	{
+		StrategyControl->nextVictimBuffer = 0;
+		StrategyControl->completePasses++;
+	}
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +471,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +560,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -383,12 +568,33 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighWaterMark < MIN_HIGH_WATER_MARK)
+		freelistHighWaterMark = MIN_HIGH_WATER_MARK;
+	else if (freelistHighWaterMark > MAX_HIGH_WATER_MARK)
+		freelistHighWaterMark = MAX_HIGH_WATER_MARK;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 5176ed0..1265bb1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2779,6 +2779,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3218 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3807955..8e58fb4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..58eb51d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,19 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free,
+										  uint32 *num_buf_alloc,
+										  uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..754a838 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,6 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..939075e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1671,6 +1671,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+    pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

#49

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Mark Kirkwood (#46)

Re: Scaling shared buffer eviction

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood <
mark.kirkwood@catalyst.net.nz> wrote:

On 05/09/14 23:50, Amit Kapila wrote:

On Fri, Sep 5, 2014 at 8:42 AM, Mark Kirkwood

FWIW below are some test results on the 60 core beast with this patch

applied to 9.4. I'll need to do more runs to iron out the variation,

but it looks like the patch helps the standard (write heavy) pgbench

workload a little, and clearly helps the read only case.

Thanks for doing the test. I think if possible you can take
the performance data with higher scale factor (4000) as it
seems your m/c has 1TB of RAM. You might want to use
latest patch I have posted today.

Here's some fairly typical data from read-write and read-only runs at

scale 4000 for 9.4 beta2 with and without the v7 patch (below). I'm not
seeing much variation between repeated read-write runs with the same config
(which is nice - sleep 30 and explicit checkpoint call between each one
seem to help there).

Interestingly, I note anecdotally that (unpatched) 9.4 beta2 seems to be

better at higher client counts than beta1 was...

In terms of the effect of the patch - looks pretty similar to the scale

2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even better
perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#50

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 11 years ago

In reply to: Amit Kapila (#49)

Re: Scaling shared buffer eviction

On 10/09/14 18:54, Amit Kapila wrote:

On Wed, Sep 10, 2014 at 5:46 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
wrote:

In terms of the effect of the patch - looks pretty similar to the

scale 2000 results for read-write, but read-only is a different and more
interesting story - unpatched 9.4 is noticeably impacted in the higher
client counts, whereas the patched version scales as well (or even
better perhaps) than in the scale 2000 case.

Yeah, that's what I was expecting, the benefit of this patch
will be more at higher client count when there is large data
and all the data can fit in RAM .

Many thanks for doing the performance test for patch.

No worries, this is looking like a patch we're going to apply to 9.4 to
make the 60 core beast scale a bit better, so thanks very much for your
work in this area.

If you would like more tests run at higher scales let me know (we have
two of these machines at pre-production state currently so I can run
benchmarks as reqd)!

Regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#48)

Re: Scaling shared buffer eviction

Hi,

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:

include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..3df2337
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c

A fair number of comments in that file refer to bgwriter...

@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in. I'd phrase it as "which
they only need to do if they don't find a victim buffer from the
freelist"

In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.  However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.

"empowered" sounds strange to me. 'still can run the clock sweep'?

+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode? But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.

+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 */
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();

Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:
/*
* Sleep at least 1 second after any error. A write error is likely
* to be repeated, and we don't want to be filling the error logs as
* fast as we can.
*/
pg_usleep(1000000L);

+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit). Maybe it should rather be

ResetLatch();
BgMoveBuffersToFreelist();
pgstat_send_bgwriter();
rc = WaitLatch()
if (rc & WL_POSTMASTER_DEATH)
exit(1)

+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends.  To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist.  It
+keeps on doing this until the number of buffers in freelist reaches the
+high water mark.
+
+Two water mark indicators are used to maintain sufficient number of buffers
+on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.

For me the description of the high water as stated here doesn't seem to
explain anything.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

/*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high water mark of freelist), we drastically reduce the odds for
+ * backend's to perform clock sweep.

Move buffers with reference and a usage_count *of* zero to freelist. By
maintaining enough buffers in the freelist (up to the list's high water
mark), we drastically reduce the likelihood of individual backends
having to perform the clock sweep themselves.

+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */

The logic used here *definitely* needs to be documented in another form
somewhere in the source.

+void
+BgMoveBuffersToFreelist(void)
+{
+	uint32	num_to_free = 0;
+	uint32	recent_alloc = 0;
+	uint32  recent_backend_clocksweep = 0;
+	volatile uint32	next_victim = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		uint32	tmp_num_to_free;
+		uint32	tmp_recent_alloc;
+		uint32  tmp_recent_backend_clocksweep;
+
+		StrategyGetFreelistAccessInfo(&tmp_num_to_free,
+									  &tmp_recent_alloc,
+									  &tmp_recent_backend_clocksweep);
+
+		num_to_free += tmp_num_to_free;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+
+		if (tmp_num_to_free == 0)
+			break;

num_to_free isn't a convincing name if I understand what this is doing
correctly. Maybe 'move_to_freelist', 'freelist_needed',
'needed_on_freelist' or something like that?

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

+		while (tmp_num_to_free > 0)
+		{
+			volatile BufferDesc *bufHdr;
+			bool	add_to_freelist = false;
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_victim);
+
+			bufHdr = &BufferDescriptors[next_victim];
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * move it to freelist; decrement the usage_count (unless pinned)
+			 * and keep scanning.
+			 */
+			LockBufHdr(bufHdr);

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high water mark and then
+ * again goes back to sleep.
+ */

s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
/to freelist/to the freelist/; s/reaches high water/reaches the high water/

+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
+#define MIN_HIGH_WATER_MARK	5
+#define MAX_HIGH_WATER_MARK	2000

I'm confused. The high water mark percentage is smaller than the low
water mark?

What's the reasoning for these numbers? What's the justification for the
max of 2k buffers for the high watermark? That's not much on a busy
database with large s_b?

-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);

/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
*/
StrategyControl->numBufferAllocs++;

/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
*/
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
bgwriterLatch = StrategyControl->bgwriterLatch;

I don't understand why these need to be grabbed under the spinlock?

if (bgwriterLatch)
- {
StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }

-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
{
buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,35 +198,82 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);

+	/* If bgwriterLatch is set, we need to waken the bgwriter */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * If the number of free buffers has fallen below the low water mark,
+	 * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
*/
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
}
-		UnlockBufHdr(buf);
}

I think it makes sense to break out this bit into its own
function. That'll make StrategyGetBuffer() a good bit easier to read.

/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
for (;;)
{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		volatile uint32	next_victim;

-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
+		StrategySyncNextVictimBuffer(&next_victim);
+
+		buf = &BufferDescriptors[next_victim];

I'd also move this into its own function, but thats's more debatable.

/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);

/*
* It is possible that we are told to put something in the freelist that
@@ -253,11 +329,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
+	}
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{

Should maybe rather be named *Tail?

+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */

When/Why is that possible?

/*
+ * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
+ * to move unused buffers to freelist.
+ *
+ * The result is the number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
+							  uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*num_buf_to_free = freelistHighWaterMark - curfreebuffers;
+	else
+		*num_buf_to_free = 0;
+
+	if (num_buf_alloc)
+	{
+		*num_buf_alloc = StrategyControl->numBufferAllocs;
+		StrategyControl->numBufferAllocs = 0;
+	}
+	if (num_buf_backend_clocksweep)
+	{
+		*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+		StrategyControl->numBufferBackendClocksweep = 0;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}

Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?

Some more general remarks:
* I think there's a fair amount of unexplained heuristics in here
* Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.
* I'm not convinced that these changes can be made without also changing
the bgwriter logic. Have you measured whether there are differences in
how effective the bgwriter is? Not that it's very effective right now :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#51)

Re: Scaling shared buffer eviction

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:

+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

It's exactly the same as what bgwriter.c does.

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter. But for this patch, I think we should aim to mimic
the existing style, as ugly as it is. I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

+Two water mark indicators are used to maintain sufficient number of buffers
+on freelist.  Low water mark indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.

For me the description of the high water as stated here doesn't seem to
explain anything.

Yeah, let me try to revise and expand on that a bit:

Background Reclaimer's Processing
---------------------------------

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist. This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found. It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary). However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process. Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark. It will then sleep. When the number of buffers on the freelist
reaches the low water mark, backends attempting to allocate new
buffers will set the bgreclaimer's latch, waking it up again. While
it's important for the high water mark to be small (for the reasons
described above), we also need to ensure adequate separation between
the low and high water marks, so that the bgreclaimer isn't constantly
being awoken to find just a handful of additional candidate buffers,
and we need to ensure that the low watermark is adequate to keep the
freelist from becoming completely empty before bgreclaimer has time to
wake up and beginning filling it again.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point. I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate. I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

* Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it. But I am not convinced either is necessary; I think it's hard
for the system to accumulate enough people hitting the freelist
simultaneously to matter, because the stuff they've got to do between
one freelist access and the next is generally going to be something
much more expensive, like reading 8kB from the OS.

One question in my mind is whether we ought to separate this patch
into two - one for the changes to the locking regime, and another for
the addition of the bgreclaimer process. Those things are really two
different features, although they are tightly enough coupled that
maybe it's OK to keep them together. I also think it would be good to
get some statistics on how often regular backends are running the
clocksweep vs. how often bgreclaimer is satisfying their needs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#52)

Re: Scaling shared buffer eviction

On 2014-09-11 09:02:34 -0400, Robert Haas wrote:

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com> wrote:
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
This looks inconsistent.
It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter.

Agreed. There really should be three variants:
* full abort including support for transactions
* full abort without transactions being used (most background processes)
* abort without shared memory interactions

But for this patch, I think we should aim to mimic
the existing style, as ugly as it is.

Agreed.

Background Reclaimer's Processing
---------------------------------

The background reclaimer runs the clock sweep to identify buffers that
are good candidates for eviction and puts them on the freelist. This
makes buffer allocation much faster, since removing a buffer from the
head of a linked list is much cheaper than linearly scanning the whole
buffer pool until a promising candidate is found. It's possible that
a buffer we add to the freelist may be accessed or even pinned before
it's evicted; if that happens, the backend that would have evicted it
will simply disregard it and take the next buffer instead (or run the
clock sweep itself, if necessary). However, to make sure that doesn't
happen too often, we need to keep the freelist as short as possible,
so that there won't be many other buffer accesses between when the
time a buffer is added to the freelist and the time when it's actually
evicted.

We use two water marks to control the activity of the bgreclaimer
process. Each time bgreclaimer is awoken, it will move buffers to the
freelist until the length of the free list reaches the high water
mark. It will then sleep.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

The bgwriter is cleaning ahead of the strategy point, and the
bgreclaimer is advancing the strategy point.

That sentence, in some form, should be in the above paragraph.

I think we should
consider having the bgreclaimer wake the bgwriter if it comes across a
dirty buffer, because while the bgwriter only estimates the rate of
buffer allocation, bgreclaimer *knows* the rate of allocation, because
its own activity is tied to the allocation rate. I think there's the
potential for this kind of thing to make the background writer
significantly more effective than it is today, but I'm heavily in
favor of leaving it for a separate patch.

Yes, doing that sounds like a good plan. I'm happy with that being done
in a separate patch.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

* Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.

As you and I have talked about before, we could reduce contention here
by partitioning the freelist, or by using a CAS loop to pop items off
of it. But I am not convinced either is necessary; I think it's hard
for the system to accumulate enough people hitting the freelist
simultaneously to matter, because the stuff they've got to do between
one freelist access and the next is generally going to be something
much more expensive, like reading 8kB from the OS.

One question in my mind is whether we ought to separate this patch
into two - one for the changes to the locking regime, and another for
the addition of the bgreclaimer process. Those things are really two
different features, although they are tightly enough coupled that
maybe it's OK to keep them together.

I think it's ok to commit them together. Hm, although: It'd actually be
highly interesting to see the effect of replacing the freelist lock with
a spinlock without the rest of these changes. I think that's really a
number we want to see at least once.

I also think it would be good to
get some statistics on how often regular backends are running the
clocksweep vs. how often bgreclaimer is satisfying their needs.

I think that's necessary. The patch added buf_backend_clocksweep. Maybe
we just also need buf_backend_from_freelist?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#52)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 6:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for reviewing, Andres.

On Thu, Sep 11, 2014 at 7:01 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
This looks inconsistent.
It's exactly the same as what bgwriter.c does.

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter. But for this patch, I think we should aim to mimic
the existing style, as ugly as it is. I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do
you suspect the same might be required due to some Signal/Interrupt
handling?

From myside, I have thought about what to keep for error cleanup based
on the working of bgreclaimer. However there is a chance that I have
missed something.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#55

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#54)

Re: Scaling shared buffer eviction

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter. But for this patch, I think we should aim to mimic
the existing style, as ugly as it is. I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do
you suspect the same might be required due to some Signal/Interrupt
handling?

I suspect it might creep in at some point at some unrelated place. Which
will only ever break in production scenarios. Say, a lwlock in in config
file processing. I seem to recall somebody seing a version of a patching
adding a lwlock there... :). Or a logging hook. Or ...

The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#55)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 6:59 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

We really need a more centralized way to handle error cleanup in
auxiliary processes. The current state of affairs is really pretty
helter-skelter. But for this patch, I think we should aim to mimic
the existing style, as ugly as it is. I'm not sure whether Amit's got
the logic correct, though: I'd agree LWLockReleaseAll(), at a minimum,
is probably a good idea.

Code related to bgreclaimer logic itself doesn't take any LWLock, do
you suspect the same might be required due to some Signal/Interrupt
handling?

I suspect it might creep in at some point at some unrelated place. Which
will only ever break in production scenarios. Say, a lwlock in in config
file processing.

Yeah, I suspected the same and checked that path, but couldn't find but
may be in some path it is there as the code has many flows.

I seem to recall somebody seing a version of a patching
adding a lwlock there... :). Or a logging hook. Or ...

The savings from not doing LWLockReleaseAll() are nonexistant, so ...

Okay, I shall add it in next version of patch and mention in comments
the reasons quoted by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#57

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#53)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:

It's exactly the same as what bgwriter.c does.

So what? There's no code in common, so I see no reason to have one
signal handler using underscores and the next one camelcase names.

/me shrugs.

It's not always possible to have things be consistent with each other
within a file and also with what gets done in other files. I'm not
sure we should fault patch authors for choosing a different one than
we would have chosen. FWIW, I probably would have done it the way
Amit did it. I don't actually care, though.

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

Me, too, but I'm *even more* skeptical about making things complicated
on the pure theory that a simple solution can't be correct. I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now. When, and if, we develop a
theory about specifically how it falls short then, sure, let's adjust
it. But I think it would be a serious error to try to engineer a
perfect algorithm here based on the amount of testing that we can
reasonably do pre-commit. We have no chance of getting that right,
and I'd rather have an algorithm that is simple and imperfect than an
algorithm that is complex and still imperfect. No matter what, it's
better than what we have now.

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

It's not unreasonable, but this is a good place to apply Knuth's first
law of optimization. There's no proof we need to do this, so let's
not until there is.

I also think it would be good to
get some statistics on how often regular backends are running the
clocksweep vs. how often bgreclaimer is satisfying their needs.

I think that's necessary. The patch added buf_backend_clocksweep. Maybe
we just also need buf_backend_from_freelist?

That's just (or should be just) buf_alloc - buf_backend_clocksweep.

I think buf_backend_clocksweep should really be called
buf_alloc_clocksweep, and should be added (in all relevant places)
right next to buf_alloc.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#57)

Re: Scaling shared buffer eviction

On 2014-09-11 09:48:10 -0400, Robert Haas wrote:

On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?

That doesn't really work unless there's some countervailing force to
eventually reduce it again; otherwise, it'd just converge to infinity.
And it doesn't really seem necessary at the moment.

Right, it obviously needs to go both ways. I'm a bit sceptic about
untunable, fixed, numbers proving to be accurate for widely varied
workloads.

Me, too, but I'm *even more* skeptical about making things complicated
on the pure theory that a simple solution can't be correct.

Fair enough.

I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now.

But, to be honest, the testing so far was pretty "narrow" in the kind of
workloads that were run if I crossread things accurately. Don't get me
wrong, I'm *really* happy about having this patch, that just doesn't
mean every detail is right ;)

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

It's not unreasonable, but this is a good place to apply Knuth's first
law of optimization. There's no proof we need to do this, so let's
not until there is.

That's true for new (pieces of) software; less so, when working with a
installed base that you might regress... But whatever.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#58)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 10:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-11 09:48:10 -0400, Robert Haas wrote:

On Thu, Sep 11, 2014 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I wonder if we should recheck the number of freelist items before
sleeping. As the latch currently is reset before sleeping (IIRC) we
might miss being woken up soon. It very well might be that bgreclaim
needs to run for more than one cycle in a row to keep up...

The outer loop in BgMoveBuffersToFreelist() was added to address
precisely this point, which I raised in a previous review.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

Good idea.

I'm not
blind to the possibility that the current logic is inadequate, but
testing proves that it works well enough to produce a massive
performance boost over where we are now.

But, to be honest, the testing so far was pretty "narrow" in the kind of
workloads that were run if I crossread things accurately. Don't get me
wrong, I'm *really* happy about having this patch, that just doesn't
mean every detail is right ;)

Oh, sure. Totally agreed. And, to the extent that we're improving
things based on actual testing, I'm A-OK with that. I just don't want
to start speculating, or we'll never get this thing off the ground.

Some possibly-interesting test cases would be:

(1) A read-only pgbench workload that is just a tiny bit larger than
shared_buffers, say size of shared_buffers plus 0.01%. Such workloads
tend to stress buffer eviction heavily.

(2) A workload that maximizes the rate of concurrent buffer eviction
relative to other tasks. Read-only pgbench is not bad for this, but
maybe somebody's got a better idea.

As I sort of mentioned in what I was writing for the bufmgr README,
there are, more or less, three ways this can fall down, at least that
I can see: (1) if the high water mark is too high, then we'll start
finding buffers in the freelist that have already been touched since
we added them: (2) if the low water mark is too low, the freelist will
run dry; and (3) if the low and high water marks are too close
together, the bgreclaimer will be constantly getting woken up and
going to sleep again. I can't personally think of a workload that
will enable us to get a better handle on those cases than
high-concurrency pgbench, but you're known to be ingenious at coming
up with destruction workloads, so if you have an idea, by all means
fire away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#51)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
+++ b/src/backend/postmaster/bgreclaimer.c
A fair number of comments in that file refer to bgwriter...

Will fix.

@@ -0,0 +1,302 @@

+/*-------------------------------------------------------------------------

+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.

+ * attempts to keep regular backends from having to run clock sweep

(which

+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page).
That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we
change here, we might want to change bgwriter file header as well.

I'd phrase it as "which
they only need to do if they don't find a victim buffer from the
freelist"

victim buffer sounds more like a buffer which it will get from
clock sweep, how about next candidate (same is used in function
header of StrategyGetBuffer()).

In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist.

However,

+ * regular backends are still empowered to run clock sweep to find a

usable

+ * buffer if the bgreclaimer fails to maintain enough buffers on

freelist.

"empowered" sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is
used in file header of bgwriter.c as well, so I think lets keep this usage
as
it is because there is no correctness issue here.

+ * The bgwriter is started by the postmaster as soon as the startup

subprocess

+ * finishes, or as soon as recovery begins if we are doing archive

recovery.

Why only archive recovery? I guess (only read this far...) it's not just
during InArchiveRecoveryb recovery but also StandbyMode?

It will be for both.

But I don't see
why we shouldn't use it during normal crash recovery. That's also often
painfully slow and the reclaimer could help? Less, but still.

Yes, it might improve a bit, however the main benefit with this patch is
under heavy load which means that it mainly addresses contention and
in case of crash recovery there will not be any contention because there
will be no backend processes.

Also I think to enable bgreclaimer during crash recovery, we need to have
one new signal which is not a problem, but it will make more sense to enable
it later based on benefit if it is there.

+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);

This looks inconsistent.

I have kept based on bgwriter, so not sure if it's good to change.
However I we want consistent in naming, I would like to keep
something like:

ReclaimShutdownHandler
ReclaimQuickDieHandler
..
..

+     /*
+      * If an exception is encountered, processing resumes here.
+      *
+      * See notes in postgres.c about the design of this coding.
+      */
+     if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+     {
+             /* Since not using PG_TRY, must reset error stack by hand

+ error_context_stack = NULL;

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed
elsewhere in thread.

+
+             /* Now we can allow interrupts again */
+             RESUME_INTERRUPTS();
Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:

Agreed, will make change as per suggestion.

+     /*
+      * Loop forever
+      */
+     for (;;)
+     {
+             int                     rc;
+
+
+             /*
+              * Backend will signal bgreclaimer when the number of

buffers in

+              * freelist falls below than low water mark of freelist.
+              */
+             rc = WaitLatch(&MyProc->procLatch,
+                                        WL_LATCH_SET |

WL_POSTMASTER_DEATH,

+ -1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit).

Could you please be more specific here?

+Background Reclaimer's Processing
+---------------------------------

+Two water mark indicators are used to maintain sufficient number of

buffers

+on freelist. Low water mark indicator is used by backends to wake

bgreclaimer

+when the number of buffers in freelist falls below it.  High water mark
+indicator is used by bgreclaimer to move buffers to freelist.
For me the description of the high water as stated here doesn't seem to
explain anything.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

As discussed in thread, I will change this accordingly.

/*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high water mark of freelist), we drastically reduce the odds for
+ * backend's to perform clock sweep.
Move buffers with reference and a usage_count *of* zero to freelist. By
maintaining enough buffers in the freelist (up to the list's high water
mark), we drastically reduce the likelihood of individual backends
having to perform the clock sweep themselves.

Okay will rephrase it as per your suggestion.

+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
The logic used here *definitely* needs to be documented in another form
somewhere in the source.

I think the way Robert has suggested to modify Readme adresses
this to an extent, however if you think it is better to go in more
detail, then I can expand function header of function
BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?

+
+             if (tmp_num_to_free == 0)
+                     break;
num_to_free isn't a convincing name if I understand what this is doing
correctly. Maybe 'move_to_freelist', 'freelist_needed',
'needed_on_freelist' or something like that?

I think keeping num or count in name could be more helpful as it is used
to loop for finding usable buffers. How about 'num_needed_on_freelist',
without any count or num it sounds more like a boolean variable.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?
+             while (tmp_num_to_free > 0)
+             {

+                     /*
+                      * If the buffer is pinned or has a nonzero

usage_count, we cannot

+ * move it to freelist; decrement the usage_count

(unless pinned)

+                      * and keep scanning.
+                      */
+                     LockBufHdr(bufHdr);
Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

As per discussion, the conclusion seems to be that we can do some
more tests to see if we need such a change, I will do more tests on
the lines suggested by Robert in below mails and then we can decide
if any thing is required.

+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high water mark and

then

+ * again goes back to sleep.
+ */

s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
/to freelist/to the freelist/; s/reaches high water/reaches the high

water/

Will change as per suggestions.

+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT     0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT      0.2
+#define MIN_HIGH_WATER_MARK  5
+#define MAX_HIGH_WATER_MARK  2000

I'm confused. The high water mark percentage is smaller than the low
water mark?

High water mark is percentage of NBuffers (shared_buffers).
Low water mark is percentage of High water mark.

I will add a comment.

What's the reasoning for these numbers?

Based on experiments with different amount of shared buffers

What's the justification for the
max of 2k buffers for the high watermark? That's not much on a busy
database with large s_b?

I have done tests at various loads upto 15GB of shared_buffers
and having data size as 44GB, 2000 turns out to be good enough
number of buffers on freelist(based on performance data).

Another thing I have mentioned in begining was to have guc for
these thresholds, but I think it will be much more difficult for
user to decide these values.

I think for now it is okay to fix these numbers as in patch
and we can anyway go back to add guc's or change them based
on more tests if anything turns out that needs further tuning.

-     *lock_held = true;
-     LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+     SpinLockAcquire(&StrategyControl->freelist_lck);

+ bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
bgwriterLatch = StrategyControl->bgwriterLatch;

I don't understand why these need to be grabbed under the spinlock?

In earlier version of patch, it was done without spinklock,
however Robert has given comment that as it was previously done
with BufFreelist lock, these should be under spinlock (atleast
that is what I understood) and I tested the patch again by having
them under spinlock and didn't find any difference, so I have moved
them under spinlock.

+ if (numFreeListBuffers < freelistLowWaterMark &&

bgreclaimerLatch)

+             SetLatch(StrategyControl->bgreclaimerLatch);
+

}
- UnlockBufHdr(buf);
}

I think it makes sense to break out this bit into its own
function. That'll make StrategyGetBuffer() a good bit easier to read.

I will try to move it into a new function GetBufferFromFreelist().

/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+

...

+ buf = &BufferDescriptors[next_victim];

I'd also move this into its own function, but thats's more debatable.

As we have not touched much of this part of code, so lets refactor
this code separately if required.

+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{

Should maybe rather be named *Tail?

Tail is better suited than End, I will change it.

+     bool            freed = false;
+     SpinLockAcquire(&StrategyControl->freelist_lck);
+
+     /*
+      * It is possible that we are told to put something in the

freelist that

+      * is already in it; don't screw up the list if so.
+      */

When/Why is that possible?

As we are doing clocksweep which can come across the same buffer
again incase it didn't find sufficient buffers to put on freelist in one
cycle, so this function needs to ensure the same.

+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32

*num_buf_alloc,

+ uint32

*num_buf_backend_clocksweep)

+
+     if (num_buf_alloc)
+     {
+             *num_buf_alloc = StrategyControl->numBufferAllocs;
+             StrategyControl->numBufferAllocs = 0;
+     }
+     if (num_buf_backend_clocksweep)
+     {
+             *num_buf_backend_clocksweep =

StrategyControl->numBufferBackendClocksweep;

+             StrategyControl->numBufferBackendClocksweep = 0;
+     }
+     SpinLockRelease(&StrategyControl->freelist_lck);
+
+     return;
+}
Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?

This API is more or less designed on lines of StrategySyncStart().
Currently there is no case where we can't do it unconditional (same is
true for num_buf_backend_clocksweep), however there is some merit
to keep it in sync with existing API. Having said that if you feel we
should
go about doing it unconditionally, I will change it in next patch.

Some more general remarks:
* I think there's a fair amount of unexplained heuristics in here

I think after addressing the comments given by you above,
the situation will be better.

* Are we sure that the freelist_lck spinlock won't cause pain? Right now
there will possibly be dozens of processes busily spinning on it... I
think it's a acceptable risk, but we should think about it.

The situation seems to be better by having spinlock, rather than
by using LWLock, however if there is anything that causes pain,
I think we might want to consider partitioning the free list, but lets
keep that for other day.

* I'm not convinced that these changes can be made without also changing
the bgwriter logic. Have you measured whether there are differences in
how effective the bgwriter is? Not that it's very effective right now :)

Current changes doesn't do anything to make bgwriter more or
less effective, however as discussed elsewhere in thread that it
makes sense for bgreclaimer to notify bgwriter in some situtions,
that can make bgwriter more effective than now, however as concluded
there lets do it in separate patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#61

Ants Aasma

ants@cybertec.at

over 11 years ago

In reply to: Andres Freund (#53)

Re: Scaling shared buffer eviction

On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.
Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Ants Aasma (#61)

Re: Scaling shared buffer eviction

On 2014-09-12 12:38:48 +0300, Ants Aasma wrote:

On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.

Meh. I'm not in favor of adding more funny games with locking unless
we can prove they're necessary for performance.

Well, this in theory increases the number of processes touching buffer
headers regularly. Currently, if you have one read IO intensive backend,
there's pretty much only process touching the cachelines. This will make
it two. I don't think it's unreasonable to try to reduce the cacheline
pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.

Right. Unless you're on a MOESI platforms. I'd really like to know why
that's not more widely used.

Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Yes. But: In many, if not most, cases the cacheline will be read a
couple times before modifying it via the spinlock.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Gregory Smith

gregsmithpgsql@gmail.com

over 11 years ago

In reply to: Andres Freund (#51)

Re: Scaling shared buffer eviction

On 9/11/14, 7:01 AM, Andres Freund wrote:

I'm not convinced that these changes can be made without also changing
the bgwriter logic. Have you measured whether there are differences in
how effective the bgwriter is? Not that it's very effective right now :)

The current background writer tuning went out of its way to do nothing
when it wasn't clear there was something that always worked. What
happened with all of the really clever schemes was that they worked on
some workloads, and just trashed others. Most of the gain from the 8.3
rewrite came from looking at well theorized ideas for how to handle
things like pre-emptive LRU scanning for writes, and just throwing them
out altogether in favor of ignoring the problem. The magic numbers left
in or added to the code were tuned to do very little work,
intentionally. If anything, since then the pressure to do nothing has
gone up in the default install, because now people are very concerned
about extra wakeups using power.

To find bad cases before, I was running about 30 different test
combinations by the end, Heikki was running another set in the EDB lab,
I believe there was a lab at NTT running their own set too. What went in
was the combination that didn't break any of them badly--not the one
that performed best on the good workloads.

This looks like it's squashed one of the very fundamental buffer scaling
issues though; well done Amit. What I'd like to see is preserving the
heart of that while touching as little as possible. When in doubt, do
nothing; let the backends suck it up and do the work themselves.

I had to take a health break from community development for a while, and
I'm hoping to jump back into review again for the rest of the current
development cycle. I'll go back to my notes and try to recreate the
pathological cases that plagued both the 8.3 BGW rewrite and the aborted
9.2 fsync spreading effort I did; get those running again and see how
they do on this new approach. I have a decent sized 24 core server that
should be good enough for this job. I'll see what I can do.

--
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#60)

2 attachment(s)

Re: Scaling shared buffer eviction

On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:
+++ b/src/backend/postmaster/bgreclaimer.c
A fair number of comments in that file refer to bgwriter...
Will fix.

Fixed.

@@ -0,0 +1,302 @@

+/*-------------------------------------------------------------------------

+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.

+ * attempts to keep regular backends from having to run clock sweep

(which

+ * they would only do when they don't find a usable shared buffer

from

+ * freelist to read in another page).

That's not really accurate. Freelist pages are often also needed to
write new pages, without reading anything in.

Agreed, but the same is used in bgwriter file as well; so if we
change here, we might want to change bgwriter file header as well.

I have just changed bgreclaimer for this comment, if the same
is required for bgwriter, I can create a separate patch as that
change is not related to this patch, so I thought it is better
to keep it separate.

I'd phrase it as "which
they only need to do if they don't find a victim buffer from the
freelist"

victim buffer sounds more like a buffer which it will get from
clock sweep, how about next candidate (same is used in function
header of StrategyGetBuffer()).

Fixed.

In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the

background

+ * reclaimer process always tries to maintain buffers on freelist.

However,

+ * regular backends are still empowered to run clock sweep to find a

usable

+ * buffer if the bgreclaimer fails to maintain enough buffers on

freelist.

"empowered" sounds strange to me. 'still can run the clock sweep'?

No harm in changing like what you are suggesting, however the same is
used in file header of bgwriter.c as well, so I think lets keep this

usage as

it is because there is no correctness issue here.

Not changed anything for this comment.

+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
This looks inconsistent.
I have kept based on bgwriter, so not sure if it's good to change.
However I we want consistent in naming, I would like to keep
something like:

ReclaimShutdownHandler
ReclaimQuickDieHandler
..
..

Changed function names to make them consistent.

+     /*
+      * If an exception is encountered, processing resumes here.
+      *
+      * See notes in postgres.c about the design of this coding.
+      */
+     if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+     {
+             /* Since not using PG_TRY, must reset error stack by

hand */

+ error_context_stack = NULL;

..

No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a
good idea, regardless of it possibly being true today (which I'm not
sure about yet).

I will add LWLockReleaseAll() in exception handling as discussed
elsewhere in thread.

Done.

+
+             /* Now we can allow interrupts again */
+             RESUME_INTERRUPTS();
Other processes sleep for a second here, I think that's a good
idea. E.g. that bit:
Agreed, will make change as per suggestion.

Done.

+     /*
+      * Loop forever
+      */
+     for (;;)
+     {
+             int                     rc;
+
+
+             /*
+              * Backend will signal bgreclaimer when the number of

buffers in

+              * freelist falls below than low water mark of freelist.
+              */
+             rc = WaitLatch(&MyProc->procLatch,
+                                        WL_LATCH_SET |

WL_POSTMASTER_DEATH,

+ -1);

That's probably not going to work well directly after a (re)start of
bgreclaim (depending on how you handle the water mark, I'll see in a
bit).

Could you please be more specific here?

I wasn't sure if any change is required here, so kept the code
as it is.

+Background Reclaimer's Processing
+---------------------------------

..

+Two water mark indicators are used to maintain sufficient number of

buffers

+on freelist. Low water mark indicator is used by backends to wake

bgreclaimer

+when the number of buffers in freelist falls below it. High water

mark

+indicator is used by bgreclaimer to move buffers to freelist.

For me the description of the high water as stated here doesn't seem to
explain anything.

This section should have a description of how the reclaimer interacts
with the bgwriter logic. Do we put dirty buffers on the freelist that
are then cleaned by the bgwriter? Which buffers does the bgwriter write
out?

As discussed in thread, I will change this accordingly.

Done

Move buffers with reference and a usage_count *of* zero to freelist. By
maintaining enough buffers in the freelist (up to the list's high water
mark), we drastically reduce the likelihood of individual backends
having to perform the clock sweep themselves.

Okay will rephrase it as per your suggestion.

Done

+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
The logic used here *definitely* needs to be documented in another form
somewhere in the source.
I think the way Robert has suggested to modify Readme adresses
this to an extent, however if you think it is better to go in more
detail, then I can expand function header of function
BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer?
+
+             if (tmp_num_to_free == 0)
+                     break;
num_to_free isn't a convincing name if I understand what this is doing
correctly. Maybe 'move_to_freelist', 'freelist_needed',
'needed_on_freelist' or something like that?
I think keeping num or count in name could be more helpful as it is used
to loop for finding usable buffers. How about 'num_needed_on_freelist',
without any count or num it sounds more like a boolean variable.

Changed.

I wonder if we don't want to increase the high watermark when
tmp_recent_backend_clocksweep > 0?
+             while (tmp_num_to_free > 0)
+             {
..
+                     /*
+                      * If the buffer is pinned or has a nonzero

usage_count, we cannot

+ * move it to freelist; decrement the

usage_count (unless pinned)

+                      * and keep scanning.
+                      */
+                     LockBufHdr(bufHdr);
Hm. Perhaps we should do a bufHdr->refcount != zero check without
locking here? The atomic op will transfer the cacheline exclusively to
the reclaimer's CPU. Even though it very shortly afterwards will be
touched afterwards by the pinning backend.
As per discussion, the conclusion seems to be that we can do some
more tests to see if we need such a change, I will do more tests on
the lines suggested by Robert in below mails and then we can decide
if any thing is required.

Still performance data related to this needs to be collected.

+/*
+ * Water mark indicators for maintaining buffers on freelist.  When

the

+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakesup and

begin

+ * adding buffer's to freelist until it reaches high water mark and

then

+ * again goes back to sleep.
+ */

s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/;
/to freelist/to the freelist/; s/reaches high water/reaches the high

water/

Will change as per suggestions.

Done

+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT     0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT      0.2
+#define MIN_HIGH_WATER_MARK  5
+#define MAX_HIGH_WATER_MARK  2000
I'm confused. The high water mark percentage is smaller than the low
water mark?
High water mark is percentage of NBuffers (shared_buffers).
Low water mark is percentage of High water mark.

I will add a comment.

Added a comment.

-     *lock_held = true;
-     LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+     SpinLockAcquire(&StrategyControl->freelist_lck);
..

+ bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
bgwriterLatch = StrategyControl->bgwriterLatch;

I don't understand why these need to be grabbed under the spinlock?

In earlier version of patch, it was done without spinklock,
however Robert has given comment that as it was previously done
with BufFreelist lock, these should be under spinlock (atleast
that is what I understood) and I tested the patch again by having
them under spinlock and didn't find any difference, so I have moved
them under spinlock.

Not changed anything related to this.

..

+ if (numFreeListBuffers < freelistLowWaterMark &&

bgreclaimerLatch)

+             SetLatch(StrategyControl->bgreclaimerLatch);
+
..

}
- UnlockBufHdr(buf);
}

I think it makes sense to break out this bit into its own
function. That'll make StrategyGetBuffer() a good bit easier to read.

I will try to move it into a new function GetBufferFromFreelist().

Moved this part of code into new function.

+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of

freelist

+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
Should maybe rather be named *Tail?
Tail is better suited than End, I will change it.

Changed.

+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32

*num_buf_alloc,

+ uint32

*num_buf_backend_clocksweep)

+
+     if (num_buf_alloc)
+     {
+             *num_buf_alloc = StrategyControl->numBufferAllocs;
+             StrategyControl->numBufferAllocs = 0;
+     }
+     if (num_buf_backend_clocksweep)
+     {
+             *num_buf_backend_clocksweep =

StrategyControl->numBufferBackendClocksweep;

+             StrategyControl->numBufferBackendClocksweep = 0;
+     }
+     SpinLockRelease(&StrategyControl->freelist_lck);
+
+     return;
+}
Do we need the if (num_buf_alloc) bits? Can't we make it unconditional?
This API is more or less designed on lines of StrategySyncStart().
Currently there is no case where we can't do it unconditional (same is
true for num_buf_backend_clocksweep), however there is some merit
to keep it in sync with existing API. Having said that if you feel we

should

go about doing it unconditionally, I will change it in next patch.

I have made it unconditional.

Hm, right. But then let's move BgWriterStats.m_buf_alloc =+,
... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end
up being continously busy without being visible.

Done and I have removed pgstat_send_bgwriter() call from
bgreclaimer loop, as after this change calling it there becomes
redundant.

Apart from this, I have changed kid of newly added function as
due to recent commit, the oid I was using is no longer available.

I have taken tpc-b data as well, it is with previous version of patch,
however the recent version hasn't changed much to impact performance
data. It takes long time to get tpc-b data, thats why I could not report
it previously.

I will post the data with the latest patch separately (where I will focus
on new cases discussed between Robert and Andres).

TPC-B Performance Data
----------------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
wal_buffers = 256MB
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 30mins

Client_count/Patch_ver 32 64 128 HEAD 420 535 556 Patch (sbe_v8) 431 537
568

About performance data
-------------------------------------
1. This data is a median of 3 runs, individual run data can
be found in attached document (perf_read_scalability_data_v8.ods)
2. There is not much difference in performance between Head and patch
which shows that this patch hasn't regressed tpc-b case.

Steps to take tpc-b data, for each run:
1. start server
2. drop db
3. create db
4. initialize db
5. run tpc-b pgbench (used prepared statement)
6. checkpoint
7. stop server

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

perf_read_scalability_data_v8.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v8.odsDownload

scalable_buffer_eviction_v9.patchapplication/octet-stream; name=scalable_buffer_eviction_v9.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 354165b..fb67cc9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 22663c3..97c23c9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -712,6 +712,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+        pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..1c64900
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,306 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they only need to do if they don't find the next candidate buffer from
+ * the freelist).  In the best scenario all requests for shared buffers will
+ * be fulfilled from freelist as the background reclaimer process always tries
+ * to maintain buffers on freelist.  However, regular backends are still
+ * empowered to run clock sweep to find a usable buffer if the bgreclaimer
+ * fails to maintain enough buffers on freelist.
+ *
+ * The bgreclaimer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void ReclaimQuickDieHandler(SIGNAL_ARGS);
+static void ReclaimSigHupHandler(SIGNAL_ARGS);
+static void ReclaimShutdownHandler(SIGNAL_ARGS);
+static void ReclaimSigUsr1Handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, ReclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReclaimShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, ReclaimQuickDieHandler);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, ReclaimSigUsr1Handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 * Currently we don't use LWLocks in bgreclaimer, however it can be
+		 * added in future in bgreclaimer or in config processing path and there
+		 * is no saving from not doing so.
+		 */
+		LWLockReleaseAll();
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  We don't want to be
+		 * filling the error logs as fast as we can.
+		 */
+		pg_usleep(1000000L);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgreclaimer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * ReclaimQuickDieHandler() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+ReclaimQuickDieHandler(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ReclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReclaimShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+ReclaimSigUsr1Handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 14535c8..565cf4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4778,6 +4820,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5122,6 +5166,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..e671d75 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
+exclusion for operations that access the buffer freelist or select
+buffers for replacement.  It also contains victimbuf_lck that protects
+information related to the current clock sweep condition.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +156,20 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  Allocating pages from this list is much cheaper than
+running the "clock sweep" algorithm, which may encounter many buffers
+that are poor candidates for eviction before finding a good candidate.
+Therefore, we have a background process called bgreclaimer which works
+to keep this list populated.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,38 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer runs the clock sweep to identify buffers that
+are good candidates for eviction and puts them on the freelist.  This
+makes buffer allocation much faster, since removing a buffer from the
+head of a linked list is much cheaper than linearly scanning the whole
+buffer pool until a promising candidate is found.  It's possible that
+a buffer we add to the freelist may be accessed or even pinned before
+it's evicted; if that happens, the backend that would have evicted it
+will simply disregard it and take the next buffer instead (or run the
+clock sweep itself, if necessary).  However, to make sure that doesn't
+happen too often, we need to keep the freelist as short as possible,
+so that there won't be many other buffer accesses between when the
+time a buffer is added to the freelist and the time when it's actually
+evicted.
+
+We use two water marks to control the activity of the bgreclaimer
+process.  Each time bgreclaimer is awoken, it will move buffers to the
+freelist until the length of the free list reaches the high water
+mark.  It will then sleep.  When the number of buffers on the freelist
+reaches the low water mark, backends attempting to allocate new
+buffers will set the bgreclaimer's latch, waking it up again.  While
+it's important for the high water mark to be small (for the reasons
+described above), we also need to ensure adequate separation between
+the low and high water marks, so that the bgreclaimer isn't constantly
+being awoken to find just a handful of additional candidate buffers,
+and we need to ensure that the low watermark is adequate to keep the
+freelist from becoming completely empty before bgreclaimer has time to
+wake up and beginning filling it again.
+
+To execute clock sweep, bgreclaimer advances the strategy point
+(victim buffer) whereas bgwriter always scan ahead of strategy point.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..559d393 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1933,6 +1925,92 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and a usage_count of zero to freelist. By
+ * maintaining enough buffers in the freelist (up to the list's high water
+ * mark), we drastically reduce the likelihood of individual backends
+ * having to perform the clock sweep themselves.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	uint32	num_needed_on_freelist = 0;
+	uint32	recent_alloc = 0;
+	uint32  recent_backend_clocksweep = 0;
+	volatile uint32	next_victim = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		uint32	tmp_num_needed_on_freelist;
+		uint32	tmp_recent_alloc;
+		uint32  tmp_recent_backend_clocksweep;
+
+		StrategyGetFreelistAccessInfo(&tmp_num_needed_on_freelist,
+									  &tmp_recent_alloc,
+									  &tmp_recent_backend_clocksweep);
+
+		num_needed_on_freelist += tmp_num_needed_on_freelist;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+
+		if (tmp_num_needed_on_freelist == 0)
+			break;
+
+		while (tmp_num_needed_on_freelist > 0)
+		{
+			volatile BufferDesc *bufHdr;
+			bool	add_to_freelist = false;
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_victim);
+
+			bufHdr = &BufferDescriptors[next_victim];
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * move it to freelist; decrement the usage_count (unless pinned)
+			 * and keep scanning.
+			 */
+			LockBufHdr(bufHdr);
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+					bufHdr->usage_count--;
+				else
+					add_to_freelist = true;
+			}
+			UnlockBufHdr(bufHdr);
+
+			if (add_to_freelist && StrategyMoveBufferToFreeListTail(bufHdr))
+				tmp_num_needed_on_freelist--;
+		}
+
+		/*
+		 * Report buffer alloc and buffer request not satisfied
+		 * from freelist counts to pgstat.
+		 */
+		BgWriterStats.m_buf_alloc += recent_alloc;
+		BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+	}
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_victim=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_victim, num_needed_on_freelist);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..e1d8445 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -37,19 +38,48 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * protects freelist and related variables (firstFreeBuffer,
+	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -84,8 +114,28 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakes up and begins
+ * adding buffers to the freelist until it reaches the high water mark and
+ * then again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/*
+ * Percentage indicators for maintaining buffers on freelist.
+ * High water mark is percentage of total number of buffers (NBuffers).
+ * and Low water mark is percentage of the high water mark.
+*/
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
+#define MIN_HIGH_WATER_MARK	5
+#define MAX_HIGH_WATER_MARK	2000
 
 /* Prototypes for internal functions */
+static volatile BufferDesc *GetBufferFromFreelist(BufferAccessStrategy strategy);
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 				volatile BufferDesc *buf);
@@ -101,103 +151,40 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
+	volatile BufferDesc *buf = NULL;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
-	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
-	 */
-	StrategyControl->numBufferAllocs++;
-
-	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
-	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
-
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
-	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
-
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
-		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
-		}
-		UnlockBufHdr(buf);
-	}
+	/* Nope, so get the buffer from freelist */
+	buf = GetBufferFromFreelist(strategy);
+	if (buf != NULL)
+		return buf;
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		volatile uint32	next_victim;
 
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
+		StrategySyncNextVictimBuffer(&next_victim);
+
+		buf = &BufferDescriptors[next_victim];
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -241,7 +228,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +240,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListTail: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -274,20 +300,73 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
+ * to move unused buffers to freelist.
+ *
+ * The result is the number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
+							  uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*num_buf_to_free = freelistHighWaterMark - curfreebuffers;
+	else
+		*num_buf_to_free = 0;
+
+	*num_buf_alloc = StrategyControl->numBufferAllocs;
+	StrategyControl->numBufferAllocs = 0;
+
+	*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+	StrategyControl->numBufferBackendClocksweep = 0;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+	{
+		StrategyControl->nextVictimBuffer = 0;
+		StrategyControl->completePasses++;
+	}
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +378,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +467,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -383,12 +475,33 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighWaterMark < MIN_HIGH_WATER_MARK)
+		freelistHighWaterMark = MIN_HIGH_WATER_MARK;
+	else if (freelistHighWaterMark > MAX_HIGH_WATER_MARK)
+		freelistHighWaterMark = MAX_HIGH_WATER_MARK;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
@@ -467,6 +580,118 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 }
 
 /*
+ * GetBufferFromFreelist -- returns a buffer from the freelist, or NULL if the
+ *		freelist is empty.
+ *
+ * The bufhdr spin lock is held on the returned buffer.
+ */
+static volatile BufferDesc *
+GetBufferFromFreelist(BufferAccessStrategy strategy)
+{
+	volatile BufferDesc *buf = NULL;
+	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
+	 */
+	StrategyControl->numBufferAllocs++;
+
+	/*
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
+	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
+	bgwriterLatch = StrategyControl->bgwriterLatch;
+	if (bgwriterLatch)
+		StrategyControl->bgwriterLatch = NULL;
+
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
+	{
+		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/* Unconditionally remove buffer from freelist */
+		StrategyControl->firstFreeBuffer = buf->freeNext;
+		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	/* If bgwriterLatch is set, we need to waken the bgwriter */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * If the number of free buffers has fallen below the low water mark,
+	 * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
+		/*
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
+		 */
+		for(;;)
+		{
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+/*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
  *		ring is empty.
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index e66430d..b7efb3d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -885,7 +885,7 @@ DATA(insert OID = 2334 (  array_agg_finalfn   PGNSP PGUID 12 1 0 0 0 f f f f f f
 DESCR("aggregate final function");
 DATA(insert OID = 2335 (  array_agg		   PGNSP PGUID 12 1 0 0 0 t f f f f f i 1 0 2277 "2283" _null_ _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("concatenate aggregate input into an array");
-DATA(insert OID = 3218 ( width_bucket	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "2283 2277" _null_ _null_ _null_ _null_ width_bucket_array _null_ _null_ _null_ ));
+DATA(insert OID = 3154 ( width_bucket	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "2283 2277" _null_ _null_ _null_ _null_ width_bucket_array _null_ _null_ _null_ ));
 DESCR("bucket number of operand given a sorted array of bucket lower bounds");
 DATA(insert OID = 3816 (  array_typanalyze PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 16 "2281" _null_ _null_ _null_ _null_ array_typanalyze _null_ _null_ _null_ ));
 DESCR("array typanalyze");
@@ -2783,6 +2783,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3218 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3807955..8e58fb4 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..b57d95a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,19 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free,
+										  uint32 *num_buf_alloc,
+										  uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..754a838 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,6 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ca56b47..939075e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1671,6 +1671,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+    pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

#65

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Gregory Smith (#63)

Re: Scaling shared buffer eviction

On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith <gregsmithpgsql@gmail.com>
wrote:

This looks like it's squashed one of the very fundamental buffer
scaling issues though; well done Amit.

Thanks.

I'll go back to my notes and try to recreate the pathological cases
that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
spreading effort I did; get those running again and see how they
do on this new approach. I have a decent sized 24 core server
that should be good enough for this job. I'll see what I can do.

It will be really helpful if you can try out those cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#66

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 11 years ago

In reply to: Amit Kapila (#65)

Re: Scaling shared buffer eviction

On 14/09/14 19:00, Amit Kapila wrote:

On Fri, Sep 12, 2014 at 11:09 PM, Gregory Smith
<gregsmithpgsql@gmail.com <mailto:gregsmithpgsql@gmail.com>> wrote:

This looks like it's squashed one of the very fundamental buffer
scaling issues though; well done Amit.

Thanks.

I'll go back to my notes and try to recreate the pathological cases
that plagued both the 8.3 BGW rewrite and the aborted 9.2 fsync
spreading effort I did; get those running again and see how they
do on this new approach. I have a decent sized 24 core server
that should be good enough for this job. I'll see what I can do.

It will be really helpful if you can try out those cases.

And if you want 'em run on the 60 core beast, just let me know the
details and I'll do some runs for you.

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#64)

1 attachment(s)

Re: Scaling shared buffer eviction

On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Sep 12, 2014 at 11:55 AM, Amit Chapel <amit.kapila16@gmail.com>
wrote:

On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On 2014-09-10 12:17:34 +0530, Amit Kapila wrote:

I will post the data with the latest patch separately (where I will focus
on new cases discussed between Robert and Andres).

Performance Data with latest version of patch.
All the data shown below is a median of 3 runs, for each
individual run data, refer attached document
(perf_read_scalability_data_v9.ods)

Performance Data for Read-only test
-----------------------------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load

Client_Count/Patch_Ver 8 16 32 64 128 HEAD 58614 107370 140717 104357
65010 sbe_v9 62943 119064 172246 220174 220904

Observations
---------------------
1. It scales well as with previous versions of patch, but
it seems the performance is slightly better in few cases,
may be because I have removed a statement (if check)
or 2 in bgreclaimer (those were done under spinlock) or it
could be just run-to-run difference.

(1) A read-only pgbench workload that is just a tiny bit larger than
shared_buffers, say size of shared_buffers plus 0.01%. Such workloads
tend to stress buffer eviction heavily.

When the data is just tiny bit larger than shared buffers, actually
there is no problem in scalability even in HEAD, because I think
most of the requests will be satisfied from existing buffer pool.
I have taken data for some of the loads where database size is
bit larger than shared buffers and it is as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)

Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8406 68712 132222 198481
290340 289828 sbe_v9 8504 68546 131926 195789 289959 289021

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)

Client_Count/Patch_Ver 1 8 16 32 64 128 HEAD 8428 68609 128092 196596
292066 293812 sbe_v9 8386 68546 126926 197126 289959 287621

Observations
---------------------
In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

I think we have below options related to this observation
a. Some further tuning in bgreclaimer, so that instead of putting
the buffers up to high water mark in freelist, it puts just 1/4th or
1/2 of high water mark and then check if the free list still contains
lesser than equal to low water mark, if yes it continues and if not
then it can wait (or may be some other way).
b. Instead of waking bgreclaimer when the number of buffers fall
below low water mark, wake when the number of times backends
does clock sweep crosses certain threshold
c. Give low and high water mark as config knobs, so that in some
rare cases users can use them to do tuning.
d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

Now if we do either of 'a' or 'b', then I think there is a chance
that the gain might not be same for cases where users can
easily get benefit from this patch and there is a chance that
it degrades the performance in some other case.

(2) A workload that maximizes the rate of concurrent buffer eviction
relative to other tasks. Read-only pgbench is not bad for this, but
maybe somebody's got a better idea.

I think the first test of pgbench (scale_factor-3000;shared_buffers-8GB)
addresses this case.

As I sort of mentioned in what I was writing for the bufmgr README,
there are, more or less, three ways this can fall down, at least that
I can see: (1) if the high water mark is too high, then we'll start
finding buffers in the freelist that have already been touched since
we added them:

I think I am able to see this effect (though mild) in one of above tests.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#68

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#67)

Re: Scaling shared buffer eviction

On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

So there are two suboptimal things that can happen and they pull in
opposite directions. I think you should instrument the server how often
each is happening. #1 is that we can pop a buffer from the freelist and
find that it's been touched. That means we wasted the effort of putting it
on the freelist in the first place. #2 is that we can want to pop a buffer
from the freelist and find it empty and thus be forced to run the clock
sweep ourselves. If we're having problem #1, we could improve things by
reducing the water marks. If we're having problem #2, we could improve
things by increasing the water marks. If we're having both problems, then
I dunno. But let's get some numbers on the frequency of these specific
things, rather than just overall tps numbers.

I think we have below options related to this observation
a. Some further tuning in bgreclaimer, so that instead of putting
the buffers up to high water mark in freelist, it puts just 1/4th or
1/2 of high water mark and then check if the free list still contains
lesser than equal to low water mark, if yes it continues and if not
then it can wait (or may be some other way).

That sounds suspiciously like just reducing the high water mark.

b. Instead of waking bgreclaimer when the number of buffers fall
below low water mark, wake when the number of times backends
does clock sweep crosses certain threshold

That doesn't sound helpful.

c. Give low and high water mark as config knobs, so that in some
rare cases users can use them to do tuning.

Yuck.

d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

At least worth entertaining.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#68)

2 attachment(s)

Re: Scaling shared buffer eviction

On Tue, Sep 16, 2014 at 10:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 16, 2014 at 8:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

In most cases performance with patch is slightly less as compare
to HEAD and the difference is generally less than 1% and in a case
or 2 close to 2%. I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required and it also leads to condition you mentioned
down (freelist will contain buffers that have already been touched since
we added them). Now for case 2 (12166MB), we need buffers more
than 2000 additional buffers, but not too many, so it can also have
similar effect.

So there are two suboptimal things that can happen and they pull in
opposite directions. I think you should instrument the server how often
each is happening. #1 is that we can pop a buffer from the freelist and
find that it's been touched. That means we wasted the effort of putting it
on the freelist in the first place. #2 is that we can want to pop a buffer
from the freelist and find it empty and thus be forced to run the clock
sweep ourselves. If we're having problem #1, we could improve things by
reducing the water marks. If we're having problem #2, we could improve
things by increasing the water marks. If we're having both problems, then
I dunno. But let's get some numbers on the frequency of these specific
things, rather than just overall tps numbers.

Specific numbers of both the configurations for which I have
posted data in previous mail are as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)
Client and Thread Count = 64
buffers_touched_freelist - count of buffers that backends found touched
after
popping from freelist.
buffers_backend_clocksweep - count of buffer allocations not satisfied from
freelist

buffers_alloc 1531023 buffers_backend_clocksweep 0
buffers_touched_freelist 0

Scale Factor - 800
Shared_Buffers - 12166MB (Total db size is 12288MB)
Client and Thread Count = 64

buffers_alloc 1531010 buffers_backend_clocksweep 0
buffers_touched_freelist 0

In both the above cases, I have taken data multiple times to ensure
correctness. From the above data, it is evident that in both the above
configurations all the requests are satisfied from the initial freelist.
Basically the amount of shared buffers configured
(12286MB = 1572608 buffers and 12166MB = 1557248 buffers) are
sufficient to contain all the work load for pgbench run.

So now the question is why we are seeing small variation (<1%) in data
in case all the data fits in shared buffers and the reason could be that
we have added few extra instructions (due to increase in StrategyControl
structure size, additional function call, one or two new assignments) in the
Buffer Allocation path (the extra instructions will also be only till all
the data
pages gets associated with buffers, after that the control won't even reach
StrategyGetBuffer()) or it may be due to variation across different runs
with
different binaries.

I have went ahead to take the data in cases shared buffers are tiny bit
(0.1%
and .05%) less than workload (based on buffer allocations done in above
cases).

Performance Data
-------------------------------

Scale Factor - 800
Shared_Buffers - 11950MB

Client_Count/Patch_Ver 8 16 32 64 128 HEAD 68424 132540 195496 279511
283280 sbe_v9 68565 132709 194631 284351 289333

Scale Factor - 800
Shared_Buffers - 11955MB

Client_Count/Patch_Ver 8 16 32 64 128 HEAD 68331 127752 196385 274387
281753 sbe_v9 68922 131314 194452 284292 287221

The above data indicates that performance is better with patch
in almost all cases and especially at high concurrency (64 and
128 client count).

The overall conclusion is that with patch
a. when the data can fit in RAM and not completely in shared buffers,
the performance/scalability is quite good even if shared buffers are just
tiny bit less that all the data.
b. when shared buffers are sufficient to contain all the data, then there is
a slight difference (<1%) in performance.

d. Lets not do anything as if user does such a configuration, he should
be educated to configure shared buffers in a better way and or the
performance hit doesn't seem to be justified to do any further
work.

At least worth entertaining.

Based on further analysis, I think this is the way to go.

Attached find the patch for new stat (buffers_touched_freelist) just in
case you want to run the patch with it and detailed (individual run)
performance data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v9_stats.patchapplication/octet-stream; name=scalable_buffer_eviction_v9_stats.patchDownload

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 97c23c9..2bc3e3a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -713,6 +713,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
         pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
+        pg_stat_get_buf_touched_freelist() AS buffers_touched_freelist,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7475e5a..6493f19 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5022,6 +5022,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
 	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
+	globalStats.buf_touched_freelist += msg->m_buf_touched_freelist;
 }
 
 /* ----------
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e1d8445..fc21865 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "pgstat.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -49,6 +50,8 @@ typedef struct
 	/* Buffers not statistied from freelist since last reset */
 	uint32		numBufferBackendClocksweep;
 
+	uint32		numBufferTouchedFreelist;
+
 	/*
 	 * protects freelist and related variables (firstFreeBuffer,
 	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
@@ -344,6 +347,9 @@ StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
 	*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
 	StrategyControl->numBufferBackendClocksweep = 0;
 
+	BgWriterStats.m_buf_touched_freelist = StrategyControl->numBufferTouchedFreelist;
+	StrategyControl->numBufferTouchedFreelist = 0;
+
 	SpinLockRelease(&StrategyControl->freelist_lck);
 
 	return;
@@ -476,6 +482,7 @@ StrategyInitialize(bool init)
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
 		StrategyControl->numBufferBackendClocksweep = 0;
+		StrategyControl->numBufferTouchedFreelist = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
@@ -667,6 +674,9 @@ GetBufferFromFreelist(BufferAccessStrategy strategy)
 
 			SpinLockAcquire(&StrategyControl->freelist_lck);
 
+			/* Buffer selected from freelist is already in use */
+			StrategyControl->numBufferTouchedFreelist++;
+
 			if (StrategyControl->firstFreeBuffer >= 0)
 			{
 				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 00d815f..06a9a6c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -101,6 +101,7 @@ extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_touched_freelist(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1503,6 +1504,12 @@ pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_touched_freelist(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_touched_freelist);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index b7efb3d..194e888 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2785,6 +2785,8 @@ DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f
 DESCR("statistics: number of buffer allocations");
 DATA(insert OID = 3218 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations not satisfied from freelsit");
+DATA(insert OID = 3156 ( pg_stat_get_buf_touched_freelist			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_touched_freelist _null_ _null_ _null_ ));
+DESCR("statistics: number of buffers in freelist that are touched");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 51a2023..c1c5c73 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -398,6 +398,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
 	PgStat_Counter m_buf_backend_clocksweep;
+	PgStat_Counter m_buf_touched_freelist;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -672,6 +673,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
 	PgStat_Counter buf_backend_clocksweep;
+	PgStat_Counter buf_touched_freelist;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;

perf_read_scalability_data_v9.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v9.odsDownload

#70

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Amit Kapila (#64)

1 attachment(s)

Re: Scaling shared buffer eviction

On Sun, Sep 14, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Apart from this, I have changed kid of newly added function as
due to recent commit, the oid I was using is no longer available.

After sending last patch I have realized that oid used in patch for
new function is wrong; fixed the same in attached patch.

I have fixed all the comments raised till now for patch and
provided the latest performance data as well, so I will mark it
as "Needs Review" to proceed further.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

scalable_buffer_eviction_v10.patchapplication/octet-stream; name=scalable_buffer_eviction_v10.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 354165b..fb67cc9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f62ed2e..095f221 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -742,6 +742,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+        pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..1c64900
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,306 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they only need to do if they don't find the next candidate buffer from
+ * the freelist).  In the best scenario all requests for shared buffers will
+ * be fulfilled from freelist as the background reclaimer process always tries
+ * to maintain buffers on freelist.  However, regular backends are still
+ * empowered to run clock sweep to find a usable buffer if the bgreclaimer
+ * fails to maintain enough buffers on freelist.
+ *
+ * The bgreclaimer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void ReclaimQuickDieHandler(SIGNAL_ARGS);
+static void ReclaimSigHupHandler(SIGNAL_ARGS);
+static void ReclaimShutdownHandler(SIGNAL_ARGS);
+static void ReclaimSigUsr1Handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, ReclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReclaimShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, ReclaimQuickDieHandler);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, ReclaimSigUsr1Handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 * Currently we don't use LWLocks in bgreclaimer, however it can be
+		 * added in future in bgreclaimer or in config processing path and there
+		 * is no saving from not doing so.
+		 */
+		LWLockReleaseAll();
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  We don't want to be
+		 * filling the error logs as fast as we can.
+		 */
+		pg_usleep(1000000L);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgreclaimer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * ReclaimQuickDieHandler() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+ReclaimQuickDieHandler(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ReclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReclaimShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+ReclaimSigUsr1Handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 14535c8..565cf4b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4778,6 +4820,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5122,6 +5166,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..e671d75 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,10 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
-exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
+exclusion for operations that access the buffer freelist or select
+buffers for replacement.  It also contains victimbuf_lck that protects
+information related to the current clock sweep condition.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -160,16 +156,20 @@ Normal Buffer Replacement Strategy
 
 There is a "free list" of buffers that are prime candidates for replacement.
 In particular, buffers that are completely free (contain no valid page) are
-always in this list.  We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that.  The list is singly-linked using fields in the
+always in this list.  Allocating pages from this list is much cheaper than
+running the "clock sweep" algorithm, which may encounter many buffers
+that are poor candidates for eviction before finding a good candidate.
+Therefore, we have a background process called bgreclaimer which works
+to keep this list populated.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations.  It works like
-this:
+need to take system-wide locks during common operations.  The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
 
 Each buffer header contains a usage counter, which is incremented (up to a
 small limit value) whenever the buffer is pinned.  (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck.  Now set the bgwriter or bgreclaimer latch if required.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
 circularly advance nextVictimBuffer for next time.
 
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used.  Decrement its usage count (if nonzero) and return to step 4 to
 examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,38 @@ As of 8.4, background writer starts during recovery mode when there is
 some form of potentially extended recovery to perform. It performs an
 identical service to normal processing, except that checkpoints it
 writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer runs the clock sweep to identify buffers that
+are good candidates for eviction and puts them on the freelist.  This
+makes buffer allocation much faster, since removing a buffer from the
+head of a linked list is much cheaper than linearly scanning the whole
+buffer pool until a promising candidate is found.  It's possible that
+a buffer we add to the freelist may be accessed or even pinned before
+it's evicted; if that happens, the backend that would have evicted it
+will simply disregard it and take the next buffer instead (or run the
+clock sweep itself, if necessary).  However, to make sure that doesn't
+happen too often, we need to keep the freelist as short as possible,
+so that there won't be many other buffer accesses between when the
+time a buffer is added to the freelist and the time when it's actually
+evicted.
+
+We use two water marks to control the activity of the bgreclaimer
+process.  Each time bgreclaimer is awoken, it will move buffers to the
+freelist until the length of the free list reaches the high water
+mark.  It will then sleep.  When the number of buffers on the freelist
+reaches the low water mark, backends attempting to allocate new
+buffers will set the bgreclaimer's latch, waking it up again.  While
+it's important for the high water mark to be small (for the reasons
+described above), we also need to ensure adequate separation between
+the low and high water marks, so that the bgreclaimer isn't constantly
+being awoken to find just a handful of additional candidate buffers,
+and we need to ensure that the low watermark is adequate to keep the
+freelist from becoming completely empty before bgreclaimer has time to
+wake up and beginning filling it again.
+
+To execute clock sweep, bgreclaimer advances the strategy point
+(victim buffer) whereas bgwriter always scan ahead of strategy point.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..559d393 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1933,6 +1925,92 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and a usage_count of zero to freelist. By
+ * maintaining enough buffers in the freelist (up to the list's high water
+ * mark), we drastically reduce the likelihood of individual backends
+ * having to perform the clock sweep themselves.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	uint32	num_needed_on_freelist = 0;
+	uint32	recent_alloc = 0;
+	uint32  recent_backend_clocksweep = 0;
+	volatile uint32	next_victim = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		uint32	tmp_num_needed_on_freelist;
+		uint32	tmp_recent_alloc;
+		uint32  tmp_recent_backend_clocksweep;
+
+		StrategyGetFreelistAccessInfo(&tmp_num_needed_on_freelist,
+									  &tmp_recent_alloc,
+									  &tmp_recent_backend_clocksweep);
+
+		num_needed_on_freelist += tmp_num_needed_on_freelist;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+
+		if (tmp_num_needed_on_freelist == 0)
+			break;
+
+		while (tmp_num_needed_on_freelist > 0)
+		{
+			volatile BufferDesc *bufHdr;
+			bool	add_to_freelist = false;
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_victim);
+
+			bufHdr = &BufferDescriptors[next_victim];
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * move it to freelist; decrement the usage_count (unless pinned)
+			 * and keep scanning.
+			 */
+			LockBufHdr(bufHdr);
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+					bufHdr->usage_count--;
+				else
+					add_to_freelist = true;
+			}
+			UnlockBufHdr(bufHdr);
+
+			if (add_to_freelist && StrategyMoveBufferToFreeListTail(bufHdr))
+				tmp_num_needed_on_freelist--;
+		}
+
+		/*
+		 * Report buffer alloc and buffer request not satisfied
+		 * from freelist counts to pgstat.
+		 */
+		BgWriterStats.m_buf_alloc += recent_alloc;
+		BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+	}
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_victim=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_victim, num_needed_on_freelist);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..e1d8445 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -37,19 +38,48 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * protects freelist and related variables (firstFreeBuffer,
+	 * lastFreeBuffer, numBufferAllocs, numBufferBackendClocksweep,
+	 * numFreeListBuffers, BufferDesc->freeNext).
+	 */
+	slock_t	     freelist_lck;
+
 	/*
-	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
+	/*
+	 * Latch to wake bgwriter.
 	 */
 	Latch	   *bgwriterLatch;
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -84,8 +114,28 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakes up and begins
+ * adding buffers to the freelist until it reaches the high water mark and
+ * then again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/*
+ * Percentage indicators for maintaining buffers on freelist.
+ * High water mark is percentage of total number of buffers (NBuffers).
+ * and Low water mark is percentage of the high water mark.
+*/
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
+#define MIN_HIGH_WATER_MARK	5
+#define MAX_HIGH_WATER_MARK	2000
 
 /* Prototypes for internal functions */
+static volatile BufferDesc *GetBufferFromFreelist(BufferAccessStrategy strategy);
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
 static void AddBufferToRing(BufferAccessStrategy strategy,
 				volatile BufferDesc *buf);
@@ -101,103 +151,40 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
-	Latch	   *bgwriterLatch;
+	volatile BufferDesc *buf = NULL;
 	int			trycounter;
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need the freelist_lck.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
-	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
-	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
-	 */
-	StrategyControl->numBufferAllocs++;
-
-	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
-	 */
-	bgwriterLatch = StrategyControl->bgwriterLatch;
-	if (bgwriterLatch)
-	{
-		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
-		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-	}
-
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
-	{
-		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
-		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
-
-		/* Unconditionally remove buffer from freelist */
-		StrategyControl->firstFreeBuffer = buf->freeNext;
-		buf->freeNext = FREENEXT_NOT_IN_LIST;
-
-		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
-		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
-		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
-		}
-		UnlockBufHdr(buf);
-	}
+	/* Nope, so get the buffer from freelist */
+	buf = GetBufferFromFreelist(strategy);
+	if (buf != NULL)
+		return buf;
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
+
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		volatile uint32	next_victim;
 
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
+		StrategySyncNextVictimBuffer(&next_victim);
+
+		buf = &BufferDescriptors[next_victim];
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -241,7 +228,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -253,12 +240,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		if (buf->freeNext < 0)
 			StrategyControl->lastFreeBuffer = buf->buf_id;
 		StrategyControl->firstFreeBuffer = buf->buf_id;
+		++StrategyControl->numFreeListBuffers;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
 /*
+ * StrategyMoveBufferToFreeListTail: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->freelist_lck);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -274,20 +300,73 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
+
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->freelist_lck);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->freelist_lck);
 	}
-	LWLockRelease(BufFreelistLock);
 	return result;
 }
 
 /*
+ * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
+ * to move unused buffers to freelist.
+ *
+ * The result is the number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
+							  uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*num_buf_to_free = freelistHighWaterMark - curfreebuffers;
+	else
+		*num_buf_to_free = 0;
+
+	*num_buf_alloc = StrategyControl->numBufferAllocs;
+	StrategyControl->numBufferAllocs = 0;
+
+	*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+	StrategyControl->numBufferBackendClocksweep = 0;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+	{
+		StrategyControl->nextVictimBuffer = 0;
+		StrategyControl->completePasses++;
+	}
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +378,27 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire the freelist_lck just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->freelist_lck);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->freelist_lck);
 }
 
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->freelist_lck);
+}
 
 /*
  * StrategyShmemSize
@@ -376,6 +467,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -383,12 +475,33 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		StrategyControl->bgreclaimerLatch = NULL;
+		SpinLockInit(&StrategyControl->freelist_lck);
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT * NBuffers;
+	if (freelistHighWaterMark < MIN_HIGH_WATER_MARK)
+		freelistHighWaterMark = MIN_HIGH_WATER_MARK;
+	else if (freelistHighWaterMark > MAX_HIGH_WATER_MARK)
+		freelistHighWaterMark = MAX_HIGH_WATER_MARK;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
@@ -467,6 +580,118 @@ FreeAccessStrategy(BufferAccessStrategy strategy)
 }
 
 /*
+ * GetBufferFromFreelist -- returns a buffer from the freelist, or NULL if the
+ *		freelist is empty.
+ *
+ * The bufhdr spin lock is held on the returned buffer.
+ */
+static volatile BufferDesc *
+GetBufferFromFreelist(BufferAccessStrategy strategy)
+{
+	volatile BufferDesc *buf = NULL;
+	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
+
+	SpinLockAcquire(&StrategyControl->freelist_lck);
+
+	/*
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can know the rate of buffer consumption and report it as stats.  Note
+	 * that buffers recycled by a strategy object are intentionally not counted
+	 * here.
+	 */
+	StrategyControl->numBufferAllocs++;
+
+	/*
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
+	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
+	bgwriterLatch = StrategyControl->bgwriterLatch;
+	if (bgwriterLatch)
+		StrategyControl->bgwriterLatch = NULL;
+
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
+	{
+		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+		/* Unconditionally remove buffer from freelist */
+		StrategyControl->firstFreeBuffer = buf->freeNext;
+		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
+
+	SpinLockRelease(&StrategyControl->freelist_lck);
+
+	/* If bgwriterLatch is set, we need to waken the bgwriter */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * If the number of free buffers has fallen below the low water mark,
+	 * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
+		/*
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
+		 */
+		for(;;)
+		{
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
+
+			SpinLockAcquire(&StrategyControl->freelist_lck);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->freelist_lck);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->freelist_lck);
+				break;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+/*
  * GetBufferFromRing -- returns a buffer from the ring, or NULL if the
  *		ring is empty.
  *
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index d30d21a..b83fe8e 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2783,6 +2783,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 3154 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2ba9885..a09eab3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -367,6 +367,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..b57d95a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
- * care of themselves.  The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock.  The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
  *
  * An exception is that if we have the buffer pinned, its tag can't change
  * underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,19 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free,
+										  uint32 *num_buf_alloc,
+										  uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..754a838 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,6 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  16
+#define NUM_BUFFER_PARTITIONS  128
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 889bcd2..26aa8ae 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1698,6 +1698,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+    pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

#71

Gregory Smith

gregsmithpgsql@gmail.com

over 11 years ago

In reply to: Amit Kapila (#67)

Re: Scaling shared buffer eviction

On 9/16/14, 8:18 AM, Amit Kapila wrote:

I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required

This is exactly what I was warning about, as the sort of lesson learned
from the last round of such tuning. There are going to be spots where
trying to tune the code to be aggressive on the hard cases will work
great. But you need to make that dynamic to some degree, such that the
code doesn't waste a lot of time sweeping buffers when the demand for
them is actually weak. That will make all sorts of cases that look like
this slower.

We should be able to tell these apart if there's enough instrumentation
and solid logic inside of the program itself though. The 8.3 era BGW
coped with a lot of these issues using a particular style of moving
average with fast reaction time, plus instrumenting the buffer
allocation rate as accurately as it could. So before getting into
high/low water note questions, are you comfortable that there's a clear,
accurate number that measures the activity level that's important here?
And have you considered ways it might be averaging over time or have a
history that's analyzed? The exact fast approach / slow decay weighted
moving average approach of the 8.3 BGW, the thing that tried to smooth
the erratic data set possible here, was a pretty critical part of
getting itself auto-tuning to workload size. It ended up being much
more important than the work of setting the arbitrary watermark levels.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Gregory Smith (#71)

Re: Scaling shared buffer eviction

On Mon, Sep 22, 2014 at 10:43 AM, Gregory Smith <gregsmithpgsql@gmail.com>
wrote:

On 9/16/14, 8:18 AM, Amit Kapila wrote:

I think the main reason for slight difference is that
when the size of shared buffers is almost same as data size, the number
of buffers it needs from clock sweep are very less, as an example in first
case (when size of shared buffers is 12286MB), it actually needs at most
256 additional buffers (2MB) via clock sweep, where as bgreclaimer
will put 2000 (high water mark) additional buffers (0.5% of shared buffers
is greater than 2000 ) in free list, so bgreclaimer does some extra work
when it is not required

This is exactly what I was warning about, as the sort of lesson learned
from the last round of such tuning. There are going to be spots where
trying to tune the code to be aggressive on the hard cases will work
great. But you need to make that dynamic to some degree, such that the
code doesn't waste a lot of time sweeping buffers when the demand for them
is actually weak. That will make all sorts of cases that look like this
slower.

To verify whether above can lead to any kind of regression, I have
checked the cases (workload is 0.05 or 0.1 percent larger than shared
buffers) where we need few extra buffers and bgreclaimer might put
some additional buffers and it turns out that in those cases also, there
is a win especially at high concurrency and results of the same are posted
upthread
(
/messages/by-id/CAA4eK1LFGcvzMdcD5NZx7B2gCbP1G7vWK7w32EZk=VOOLUds-A@mail.gmail.com).

We should be able to tell these apart if there's enough instrumentation
and solid logic inside of the program itself though. The 8.3 era BGW coped
with a lot of these issues using a particular style of moving average with
fast reaction time, plus instrumenting the buffer allocation rate as
accurately as it could. So before getting into high/low water note
questions, are you comfortable that there's a clear, accurate number that
measures the activity level that's important here?

Very Good Question. This was exactly the thing which was
missing in my initial versions (about 2 years back when I tried to
solve this problem) but based on Robert's and Andres's feedback
I realized that we need an accurate number to measure the activity
level (in this case it is consumption of buffers from freelist), so
I have introduced the logic to calculate the same (it is stored in new
variable numFreeListBuffers in BufferStrategyControl structure).

And have you considered ways it might be averaging over time or have a
history that's analyzed?

The current logic of bgreclaimer is such that even if it does
some extra activity (extra is very much controlled) in one cycle,
it will not start another cycle unless backends consume all the
buffers that were made available by bgreclaimer in one cycle.
I think the algorithm designed for bgreclaimer automatically
averages out based on activity. Do you see any cases where it
will not do so?

The exact fast approach / slow decay weighted moving average approach of
the 8.3 BGW, the thing that tried to smooth the erratic data set possible
here, was a pretty critical part of getting itself auto-tuning to workload
size. It ended up being much more important than the work of setting the
arbitrary watermark levels.

Agreed, but the logic with which bgwriter works is pretty different
and thats why it needs different kind of logic to handle auto-tuning.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#73

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#69)

Re: Scaling shared buffer eviction

On Fri, Sep 19, 2014 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Specific numbers of both the configurations for which I have
posted data in previous mail are as follows:

Scale Factor - 800
Shared_Buffers - 12286MB (Total db size is 12288MB)
Client and Thread Count = 64
buffers_touched_freelist - count of buffers that backends found touched
after
popping from freelist.
buffers_backend_clocksweep - count of buffer allocations not satisfied
from freelist

buffers_alloc 1531023 buffers_backend_clocksweep 0
buffers_touched_freelist 0

I didn't believe these numbers, so I did some testing. I used the same
configuration you mention here, scale factor = 800, shared_buffers = 12286
MB, and I also saw buffers_backend_clocksweep = 0. I didn't see
buffers_touched_freelist showing up anywhere, so I don't know whether that
would have been zero or not. Then I tried reducing the high watermark for
the freelist from 2000 buffers to 25 buffers, and
buffers_backend_clocksweep was *still* 0. At that point I started to smell
a rat. It turns out that, with this test configuration, there's no buffer
allocation going on at all. Everything fits in shared_buffers, or it did
on my test. I had to reduce shared_buffers down to 10491800kB before I got
any significant buffer eviction.

At that level, a 100-buffer high watermark wasn't sufficient to prevent the
freelist from occasionally going empty. A 2000-buffer high water mark was
by and large sufficient, although I was able to see small numbers of
buffers being allocated via clocksweep right at the very beginning of the
test, I guess before the reclaimer really got cranking. So the watermarks
seem to be broadly in the right ballpark, but I think the statistics
reporting needs improving. We need an easy way to measure the amount of
work that bgreclaimer is actually doing.

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the clock
sweep (i.e. the number of buffers it had to examine in order to reclaim the
number counted by #1).
4. The number of buffers past which a backend has advanced the clocksweep
(i.e. the number of buffers it had to examine in order to allocate the
number of buffers count by #3).
5. The number of buffers allocated from the freelist which the backend did
not use because they'd been touched (what you're calling
buffers_touched_freelist).

It's hard to come up with good names for all of these things that are
consistent with the somewhat wonky existing names. Here's an attempt:

1. bgreclaim_freelist
2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I
think we want to make it more parallel with buffers_alloc, which is the
number of buffers allocated, not buffers_backend, the number of buffers
*written* by a backend)
3. clocksweep_bgreclaim
4. clocksweep_backend
5. freelist_touched

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#74

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#73)

Re: Scaling shared buffer eviction

Hi,

On 2014-09-23 10:31:24 -0400, Robert Haas wrote:

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the clock
sweep (i.e. the number of buffers it had to examine in order to reclaim the
number counted by #1).

4. The number of buffers past which a backend has advanced the clocksweep
(i.e. the number of buffers it had to examine in order to allocate the
number of buffers count by #3).

5. The number of buffers allocated from the freelist which the backend did
not use because they'd been touched (what you're calling
buffers_touched_freelist).

Sounds good.

It's hard to come up with good names for all of these things that are
consistent with the somewhat wonky existing names. Here's an attempt:

1. bgreclaim_freelist

bgreclaim_alloc_clocksweep?

2. buffers_alloc_clocksweep (you've got buffers_backend_clocksweep, but I
think we want to make it more parallel with buffers_alloc, which is the
number of buffers allocated, not buffers_backend, the number of buffers
*written* by a backend)
3. clocksweep_bgreclaim
4. clocksweep_backend

I think bgreclaim/backend should always be either be a prefix or a
postfix. But not one in some variables and some in another.

5. freelist_touched

I wonder if we shouldn't move all this to a new view, instead of
stuffing it somewhere where it really doesn't belong. pg_stat_buffers or
something like it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Robert Haas (#73)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

[ review ]

Oh, by the way, I noticed that this patch breaks pg_buffercache. If
we're going to have 128 lock partitions, we need to bump
MAX_SIMUL_LWLOCKS.

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

I think it's pretty clear that #1 and #2 are a good idea. #3 is a
mixed bag, and it might account for the regressions you saw on some
test cases. Increasing the number of buffer mapping locks means that
those locks take up more cache lines, which could slow things down in
cases where there's no reduction in contention. It also means that
the chances of an allocated buffer ending up in the same buffer
mapping lock partition are 1/128 instead of 1/16, which means about
5.4 additional lwlock acquire/release cycles per 100 allocations.
That's not a ton, but it's not free either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Robert Haas (#75)

1 attachment(s)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:

         - LWLockAcquire
            + 68.41% ReadBuffer_common
            + 31.59% StrategyGetBuffer

With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks. But are we suffering spinlock contention instead as a
result? Yes, but not much. s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer. In a similar profile of master, s_lock is at
0.31%. So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

TPS results are a little higher with the patch - these are alternating
5 minute runs:

master tps = 176010.647944 (including connections establishing)
master tps = 176615.291149 (including connections establishing)
master tps = 175648.370487 (including connections establishing)
reduce-replacement-locking tps = 177888.734320 (including connections
establishing)
reduce-replacement-locking tps = 177797.842410 (including connections
establishing)
reduce-replacement-locking tps = 177894.822656 (including connections
establishing)

The picture is similar at 64 clients, but the benefit is a little more:

master tps = 179037.231597 (including connections establishing)
master tps = 180500.937068 (including connections establishing)
master tps = 181565.706514 (including connections establishing)
reduce-replacement-locking tps = 185741.503425 (including connections
establishing)
reduce-replacement-locking tps = 188598.728062 (including connections
establishing)
reduce-replacement-locking tps = 187340.977277 (including connections
establishing)

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

reduce-replacement-locking-128 tps = 251001.812843 (including
connections establishing)
reduce-replacement-locking-128 tps = 247368.925927 (including
connections establishing)
reduce-replacement-locking-128 tps = 250775.304177 (including
connections establishing)

The performance also goes up if I do that on master, but the effect is
substantially less:

master-128 tps = 219301.492902 (including connections establishing)
master-128 tps = 219786.249076 (including connections establishing)
master-128 tps = 219821.220271 (including connections establishing)

I think this shows pretty clearly that, even without the bgreclaimer,
there's a lot of merit in getting rid of BufFreelistLock and using a
spinlock held for the absolutely minimal number of instructions
instead. There's already some benefit without doing anything about
the buffer mapping locks, and we'll get a lot more benefit once that
issue is addressed. I think we need to do some serious study to see
whether bgreclaimer is even necessary, because it looks like this
change alone, which is much simpler, takes a huge bite out of the
scalability problem.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch. To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win. We can then deal with
the other issues on their own merits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

reduce-replacement-locking.patchtext/x-patch; charset=US-ASCII; name=reduce-replacement-locking.patchDownload

commit df4077cda2eae3eb4a5cf387da0c1e7616e73204
Author: Robert Haas <rhaas@postgresql.org>
Date:   Mon Sep 22 16:42:14 2014 -0400

    Remove volatile qualifiers from lwlock.c.
    
    Now that spinlocks (hopefully!) act as compiler barriers, as of commit
    0709b7ee72e4bc71ad07b7120acd117265ab51d0, this should be safe.  This
    serves as a demonstration of the new coding style, and may be optimized
    better on some machines as well.

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 7c96da5..66fb2e4 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -112,7 +112,7 @@ static lwlock_stats lwlock_stats_dummy;
 bool		Trace_lwlocks = false;
 
 inline static void
-PRINT_LWDEBUG(const char *where, const volatile LWLock *lock)
+PRINT_LWDEBUG(const char *where, const LWLock *lock)
 {
 	if (Trace_lwlocks)
 		elog(LOG, "%s(%s %d): excl %d shared %d head %p rOK %d",
@@ -406,9 +406,7 @@ LWLock *
 LWLockAssign(void)
 {
 	LWLock	   *result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -429,9 +427,7 @@ int
 LWLockNewTrancheId(void)
 {
 	int			result;
-
-	/* use volatile pointer to prevent code rearrangement */
-	volatile int *LWLockCounter;
+	int		   *LWLockCounter;
 
 	LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 	SpinLockAcquire(ShmemLock);
@@ -511,9 +507,8 @@ LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val)
 
 /* internal function to implement LWLockAcquire and LWLockAcquireWithVar */
 static inline bool
-LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
+LWLockAcquireCommon(LWLock *lock, LWLockMode mode, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
 	bool		result = true;
@@ -525,7 +520,7 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	PRINT_LWDEBUG("LWLockAcquire", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 
 	/* Count lock acquisition attempts */
 	if (mode == LW_EXCLUSIVE)
@@ -642,13 +637,13 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -659,9 +654,9 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG("LWLockAcquire", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockAcquire", T_NAME(lock), T_ID(lock), "awakened");
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
@@ -675,10 +670,10 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), T_ID(lock), mode);
 
 	/* Add lock to list of locks held by this backend */
-	held_lwlocks[num_held_lwlocks++] = l;
+	held_lwlocks[num_held_lwlocks++] = lock;
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -697,9 +692,8 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
  * If successful, cancel/die interrupts are held off until lock release.
  */
 bool
-LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
+LWLockConditionalAcquire(LWLock *lock, LWLockMode mode)
 {
-	volatile LWLock *lock = l;
 	bool		mustwait;
 
 	PRINT_LWDEBUG("LWLockConditionalAcquire", lock);
@@ -747,14 +741,16 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockConditionalAcquire", T_NAME(l), T_ID(l), "failed");
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(l), T_ID(l), mode);
+		LOG_LWDEBUG("LWLockConditionalAcquire",
+					T_NAME(lock), T_ID(lock), "failed");
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE_FAIL(T_NAME(lock),
+												 T_ID(lock), mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
-		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(l), T_ID(l), mode);
+		held_lwlocks[num_held_lwlocks++] = lock;
+		TRACE_POSTGRESQL_LWLOCK_CONDACQUIRE(T_NAME(lock), T_ID(lock), mode);
 	}
 
 	return !mustwait;
@@ -775,9 +771,8 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
  * wake up, observe that their records have already been flushed, and return.
  */
 bool
-LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
+LWLockAcquireOrWait(LWLock *lock, LWLockMode mode)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		mustwait;
 	int			extraWaits = 0;
@@ -788,7 +783,7 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	PRINT_LWDEBUG("LWLockAcquireOrWait", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 #endif
 
 	/* Ensure we will have room to remember the lock */
@@ -855,13 +850,14 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 		 * Wait until awakened.  Like in LWLockAcquire, be prepared for bogus
 		 * wakups, because we share the semaphore with ProcWaitForSignal.
 		 */
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock),
+					"waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock), mode);
 
 		for (;;)
 		{
@@ -872,9 +868,10 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock), mode);
 
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock),
+					"awakened");
 	}
 	else
 	{
@@ -892,14 +889,16 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	{
 		/* Failed to get lock, so release interrupt holdoff */
 		RESUME_INTERRUPTS();
-		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(l), T_ID(l), "failed");
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(l), T_ID(l), mode);
+		LOG_LWDEBUG("LWLockAcquireOrWait", T_NAME(lock), T_ID(lock), "failed");
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT_FAIL(T_NAME(lock), T_ID(lock),
+													 mode);
 	}
 	else
 	{
 		/* Add lock to list of locks held by this backend */
-		held_lwlocks[num_held_lwlocks++] = l;
-		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(l), T_ID(l), mode);
+		held_lwlocks[num_held_lwlocks++] = lock;
+		TRACE_POSTGRESQL_LWLOCK_ACQUIRE_OR_WAIT(T_NAME(lock), T_ID(lock),
+												mode);
 	}
 
 	return !mustwait;
@@ -924,10 +923,8 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
  * in shared mode, returns 'true'.
  */
 bool
-LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
+LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval)
 {
-	volatile LWLock *lock = l;
-	volatile uint64 *valp = valptr;
 	PGPROC	   *proc = MyProc;
 	int			extraWaits = 0;
 	bool		result = false;
@@ -938,7 +935,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	PRINT_LWDEBUG("LWLockWaitForVar", lock);
 
 #ifdef LWLOCK_STATS
-	lwstats = get_lwlock_stats_entry(l);
+	lwstats = get_lwlock_stats_entry(lock);
 #endif   /* LWLOCK_STATS */
 
 	/*
@@ -981,7 +978,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		}
 		else
 		{
-			value = *valp;
+			value = *valptr;
 			if (value != oldval)
 			{
 				result = false;
@@ -1023,13 +1020,14 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		 * so that the lock manager or signal manager will see the received
 		 * signal when it next waits.
 		 */
-		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "waiting");
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(lock), T_ID(lock), "waiting");
 
 #ifdef LWLOCK_STATS
 		lwstats->block_count++;
 #endif
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), T_ID(lock),
+										   LW_EXCLUSIVE);
 
 		for (;;)
 		{
@@ -1040,9 +1038,10 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 			extraWaits++;
 		}
 
-		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), T_ID(lock),
+										  LW_EXCLUSIVE);
 
-		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "awakened");
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(lock), T_ID(lock), "awakened");
 
 		/* Now loop back and check the status of the lock again. */
 	}
@@ -1050,7 +1049,7 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), LW_EXCLUSIVE);
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(lock), T_ID(lock), LW_EXCLUSIVE);
 
 	/*
 	 * Fix the process wait semaphore's count for any absorbed wakeups.
@@ -1078,10 +1077,8 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
  * The caller must be holding the lock in exclusive mode.
  */
 void
-LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
+LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 {
-	volatile LWLock *lock = l;
-	volatile uint64 *valp = valptr;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	PGPROC	   *next;
@@ -1093,7 +1090,7 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
 	Assert(lock->exclusive == 1);
 
 	/* Update the lock's value */
-	*valp = val;
+	*valptr = val;
 
 	/*
 	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
@@ -1139,9 +1136,8 @@ LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
  * LWLockRelease - release a previously acquired lock
  */
 void
-LWLockRelease(LWLock *l)
+LWLockRelease(LWLock *lock)
 {
-	volatile LWLock *lock = l;
 	PGPROC	   *head;
 	PGPROC	   *proc;
 	int			i;
@@ -1154,11 +1150,11 @@ LWLockRelease(LWLock *l)
 	 */
 	for (i = num_held_lwlocks; --i >= 0;)
 	{
-		if (l == held_lwlocks[i])
+		if (lock == held_lwlocks[i])
 			break;
 	}
 	if (i < 0)
-		elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
+		elog(ERROR, "lock %s %d is not held", T_NAME(lock), T_ID(lock));
 	num_held_lwlocks--;
 	for (; i < num_held_lwlocks; i++)
 		held_lwlocks[i] = held_lwlocks[i + 1];
@@ -1238,14 +1234,15 @@ LWLockRelease(LWLock *l)
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
-	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(l), T_ID(l));
+	TRACE_POSTGRESQL_LWLOCK_RELEASE(T_NAME(lock), T_ID(lock));
 
 	/*
 	 * Awaken any waiters I removed from the queue.
 	 */
 	while (head != NULL)
 	{
-		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
+		LOG_LWDEBUG("LWLockRelease", T_NAME(lock), T_ID(lock),
+					"release waiter");
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;

#77

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Robert Haas (#76)

1 attachment(s)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, ...

...and that was the wrong patch. Thanks to Heikki for point that out.

Second try.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

reduce-replacement-locking.patchtext/x-patch; charset=US-ASCII; name=reduce-replacement-locking.patchDownload

diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..a4ebbcc 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,12 @@ bits of the tag's hash value.  The rules stated above apply to each partition
 independently.  If it is necessary to lock more than one partition at a time,
 they must be locked in partition-number order to avoid risk of deadlock.
 
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
+* A separate system-wide spinlock, buffer_strategy_lock, provides mutual
 exclusion for operations that access the buffer free list or select
-buffers for replacement.  This is always taken in exclusive mode since
-there are no read-only operations on those data structures.  The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.)  It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+buffers for replacement.  A spinlock is used here rather than a lightweight
+lock for efficiency; no other locks of any sort should be acquired while
+buffer_strategy_lock is held.  This is essential to allow buffer replacement
+to happen in multiple backends with reasonable concurrency.
 
 * Each buffer header contains a spinlock that must be taken when examining
 or changing fields of that buffer header.  This allows operations such as
@@ -165,7 +163,7 @@ consider their pages unlikely to be needed soon; however, the current
 algorithm never does that.  The list is singly-linked using fields in the
 buffer headers; we maintain head and tail pointers in global variables.
 (Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the buffer_strategy_lock, not the buffer-header
 spinlocks.)  To choose a victim buffer to recycle when there are no free
 buffers available, we use a simple clock-sweep algorithm, which avoids the
 need to take system-wide locks during common operations.  It works like
@@ -178,25 +176,26 @@ buffer reference count, so it's nearly free.)
 
 The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
 through all the available buffers.  nextVictimBuffer is protected by the
-BufFreelistLock.
+buffer_strategy_lock.
 
 The algorithm for a process that needs to obtain a victim buffer is:
 
-1. Obtain BufFreelistLock.
+1. Obtain buffer_strategy_lock.
 
-2. If buffer free list is nonempty, remove its head buffer.  If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2.  Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer.  Release
+buffer_strategy_lock.  If the buffer is pinned or has a nonzero usage count,
+it cannot be used; ignore it go back to step 1.  Otherwise, pin the buffer,
+and return it.
 
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
-circularly advance nextVictimBuffer for next time.
+3. Otherwise, the buffer free list is empty.  Select the buffer pointed to by
+nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
+Release buffer_strategy_lock.
 
 4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used.  Decrement its usage count (if nonzero) and return to step 3 to
-examine the next buffer.
+be used.  Decrement its usage count (if nonzero), reacquire
+buffer_strategy_lock, and return to step 3 to examine the next buffer.
 
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+5. Pin the selected buffer, and return.
 
 (Note that if the selected buffer is dirty, we will have to write it out
 before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +258,7 @@ dirty and not pinned nor marked with a positive usage count.  It pins,
 writes, and releases any such buffer.
 
 If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take buffer_strategy_lock in order to look
 for buffers to write; it needs only to spinlock each buffer header for long
 enough to check the dirtybit.  Even without that assumption, the writer
 only needs to take the lock long enough to read the variable value, not
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3240432..45d1d61 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		bool		lock_held;
-
 		/*
 		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!  Also (in most cases) the BufFreelistLock is
-		 * still held, since it would be bad to hold the spinlock while
-		 * possibly waking up other processes.
+		 * spinlock still held!
 		 */
-		buf = StrategyGetBuffer(strategy, &lock_held);
+		buf = StrategyGetBuffer(strategy);
 
 		Assert(buf->refcount == 0);
 
@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
-		/* Now it's safe to release the freelist lock */
-		if (lock_held)
-			LWLockRelease(BufFreelistLock);
-
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..5966beb 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -24,6 +24,9 @@
  */
 typedef struct
 {
+	/* Spinlock: protects the values below */
+	slock_t		buffer_strategy_lock;
+
 	/* Clock sweep hand: index of next buffer to consider grabbing */
 	int			nextVictimBuffer;
 
@@ -101,15 +104,10 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
  *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
  *
  *	To ensure that no one else can pin the buffer before we do, we must
- *	return the buffer with the buffer header spinlock still held.  If
- *	*lock_held is set on exit, we have returned with the BufFreelistLock
- *	still held, as well; the caller must release that lock once the spinlock
- *	is dropped.  We do it that way because releasing the BufFreelistLock
- *	might awaken other processes, and it would be bad to do the associated
- *	kernel calls while holding the buffer header spinlock.
+ *	return the buffer with the buffer header spinlock still held.
  */
 volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
 {
 	volatile BufferDesc *buf;
 	Latch	   *bgwriterLatch;
@@ -117,21 +115,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 
 	/*
 	 * If given a strategy object, see whether it can select a buffer. We
-	 * assume strategy objects don't need the BufFreelistLock.
+	 * assume strategy objects don't need buffer_strategy_lock.
 	 */
 	if (strategy != NULL)
 	{
 		buf = GetBufferFromRing(strategy);
 		if (buf != NULL)
-		{
-			*lock_held = false;
 			return buf;
-		}
 	}
 
 	/* Nope, so lock the freelist */
-	*lock_held = true;
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
@@ -142,22 +136,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 
 	/*
 	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding BufFreelistLock; so release and re-grab.  This
-	 * is annoyingly tedious, but it happens at most once per bgwriter cycle,
-	 * so the performance hit is minimal.
+	 * not do so while holding buffer_strategy_lock; so release and re-grab.
+	 * This is annoyingly tedious, but it happens at most once per bgwriter
+	 * cycle, so the performance hit is minimal.
 	 */
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
 	{
 		StrategyControl->bgwriterLatch = NULL;
-		LWLockRelease(BufFreelistLock);
+		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 		SetLatch(bgwriterLatch);
-		LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 
 	/*
 	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the BufFreelistLock not the
+	 * are considered to be protected by the buffer_strategy_lock not the
 	 * individual buffer spinlocks, so it's OK to manipulate them without
 	 * holding the spinlock.
 	 */
@@ -171,6 +165,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
 
 		/*
+		 * Release the lock so someone else can access the freelist (or run
+		 * the clocksweep) while we check out this buffer.
+		 */
+		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; discard it and retry.  (This can only happen if VACUUM put a
 		 * valid buffer in the freelist and then someone else used it before
@@ -185,6 +185,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			return buf;
 		}
 		UnlockBufHdr(buf);
+
+		/* Reacquire the lock and go around for another pass. */
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
@@ -199,6 +202,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			StrategyControl->completePasses++;
 		}
 
+		/* Release the lock before manipulating the candidate buffer. */
+		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
 		 * it; decrement the usage_count (unless pinned) and keep scanning.
@@ -232,6 +238,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 			elog(ERROR, "no unpinned buffers available");
 		}
 		UnlockBufHdr(buf);
+
+		/* Reacquire the lock and get a new candidate buffer. */
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 }
 
@@ -241,7 +250,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
 void
 StrategyFreeBuffer(volatile BufferDesc *buf)
 {
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
 	/*
 	 * It is possible that we are told to put something in the freelist that
@@ -255,7 +264,7 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 		StrategyControl->firstFreeBuffer = buf->buf_id;
 	}
 
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
 /*
@@ -274,7 +283,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	result = StrategyControl->nextVictimBuffer;
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
@@ -283,7 +292,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
 	}
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
 }
 
@@ -299,13 +308,13 @@ void
 StrategyNotifyBgWriter(Latch *bgwriterLatch)
 {
 	/*
-	 * We acquire the BufFreelistLock just to ensure that the store appears
+	 * We acquire buffer_strategy_lock just to ensure that the store appears
 	 * atomic to StrategyGetBuffer.  The bgwriter should call this rather
 	 * infrequently, so there's no performance penalty from being safe.
 	 */
-	LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	StrategyControl->bgwriterLatch = bgwriterLatch;
-	LWLockRelease(BufFreelistLock);
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 }
 
 
@@ -370,6 +379,8 @@ StrategyInitialize(bool init)
 		 */
 		Assert(init);
 
+		SpinLockInit(&StrategyControl->buffer_strategy_lock);
+
 		/*
 		 * Grab the whole linked list of free buffers for our strategy. We
 		 * assume it was previously set up by InitBufferPool().
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..0e69b63 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,7 +115,7 @@ typedef struct buftag
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
  * usage_count, refcount, or wait_backend_pid fields.  buf_id field never
  * changes after initialization, so does not need locking.  freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock.  The LWLocks can take
+ * protected by the buffer_strategy_lock not buf_hdr_lock.  The LWLocks can take
  * care of themselves.  The buf_hdr_lock is *not* used to control access to
  * the data in the buffer!
  *
@@ -185,8 +185,7 @@ extern BufferDesc *LocalBufferDescriptors;
  */
 
 /* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
-				  bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..90ae7d4 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,7 +89,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  * if you remove a lock, consider leaving a gap in the numbering sequence for
  * the benefit of DTrace and other external debugging scripts.
  */
-#define BufFreelistLock				(&MainLWLockArray[0].lock)
+/* 0 is available; was formerly BufFreelistLock */
 #define ShmemIndexLock				(&MainLWLockArray[1].lock)
 #define OidGenLock					(&MainLWLockArray[2].lock)
 #define XidGenLock					(&MainLWLockArray[3].lock)

#78

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Robert Haas (#77)

Re: Scaling shared buffer eviction

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, ...

...and that was the wrong patch. Thanks to Heikki for point that out.
Second try.

But the results you gave in the previous message were correctly
attributed?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Tom Lane (#78)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Sep 23, 2014 at 4:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, ...

...and that was the wrong patch. Thanks to Heikki for point that out.
Second try.

But the results you gave in the previous message were correctly
attributed?

The patch I attached the first time was just the last commit in the
git repository where I wrote the patch, rather than the changes that I
made on top of that commit. So, yes, the results from the previous
message are with the patch attached to the follow-up. I just typed
the wrong git command when attempting to extract that patch to attach
it to the email.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Gregory Smith

gregsmithpgsql@gmail.com

over 11 years ago

In reply to: Robert Haas (#73)

Re: Scaling shared buffer eviction

On 9/23/14, 10:31 AM, Robert Haas wrote:

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the
clock sweep (i.e. the number of buffers it had to examine in order to
reclaim the number counted by #1).
4. The number of buffers past which a backend has advanced the
clocksweep (i.e. the number of buffers it had to examine in order to
allocate the number of buffers count by #3).
5. The number of buffers allocated from the freelist which the backend
did not use because they'd been touched (what you're calling
buffers_touched_freelist).

All sound reasonable. To avoid wasting time here, I think it's only
worth doing all of these as DEBUG level messages for now. Then only go
through the overhead of exposing the ones that actually seem relevant.
That's what I did for the 8.3 work, and most of that data at this level
was barely relevant to anyone but me then or since. We don't want the
system views to include so much irrelevant trivia that finding the
important parts becomes overwhelming.

I'd like to see that level of instrumentation--just the debug level
messages--used to quantify the benchmarks that people are running
already, to prove they are testing what they think they are. That would
have caught the test error you already stumbled on for example. Simple
paranoia says there may be more issues like that hidden in here
somewhere, and this set you've identified should find any more of them
around.

If all that matches up so the numbers for the new counters seem sane, I
think that's enough to commit the first basic improvement here. Then
make a second pass over exposing the useful internals that let everyone
quantify either the improvement or things that may need to be tunable.

--
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#76)

Re: Scaling shared buffer eviction

On 2014-09-23 16:29:16 -0400, Robert Haas wrote:

On Tue, Sep 23, 2014 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

I did some more experimentation on this. Attached is a patch that
JUST does #1, and, as previously suggested, it uses a single spinlock
instead of using two of them that are probably in the same cacheline.
Without this patch, on a 32-client, read-only pgbench at scale factor
1000 and shared_buffers=8GB, perf -e cs -g -a has this to say about
LWLock-inspired preemption:
- LWLockAcquire
+ 68.41% ReadBuffer_common
+ 31.59% StrategyGetBuffer
With the patch, unsurprisingly, StrategyGetBuffer disappears and the
only lwlocks that are causing context-switches are the individual
buffer locks. But are we suffering spinlock contention instead as a
result? Yes, but not much. s_lock is at 0.41% in the corresponding
profile, and only 6.83% of those calls are from the patched
StrategyGetBuffer. In a similar profile of master, s_lock is at
0.31%. So there's a bit of additional s_lock contention, but I think
it's considerably less than the contention over the lwlock it's
replacing, because the spinlock is only held for the minimal amount of
time needed, whereas the lwlock could be held across taking and
releasing many individual buffer locks.

Am I understanding you correctly that you also measured context switches
for spinlocks? If so, I don't think that's a valid comparison. LWLocks
explicitly yield the CPU as soon as there's any contention while
spinlocks will, well, spin. Sure they also go to sleep, but it'll take
longer.

It's also worthwile to remember in such comparisons that lots of the
cost of spinlocks will be in the calling routines, not s_lock() - the
first TAS() is inlined into it. And that's the one that'll incur cache
misses and such...

Note that I'm explicitly *not* doubting the use of a spinlock
itself. Given the short acquiration times and the exclusive use of
exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
alone will have about as much contention.

I think it might be possible to construct some cases where the spinlock
performs worse than the lwlock. But I think those will be clearly in the
minority. And at least some of those will be fixed by bgwriter. As an
example consider a single backend COPY ... FROM of large files with a
relatively large s_b. That's a seriously bad workload for the current
victim buffer selection because frequently most of the needs to be
searched for a buffer with usagecount 0. And this patch will double the
amount of atomic ops during that.

Let me try to quantify that.

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

What did events did you profile?

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch. To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win. We can then deal with
the other issues on their own merits.

I've took a look at this, and all the stuff I saw that I disliked were
there before this patch. So +1 for going ahead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Gregory Smith (#80)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 6:02 PM, Gregory Smith <gregsmithpgsql@gmail.com> wrote:

On 9/23/14, 10:31 AM, Robert Haas wrote:

I suggest we count these things:

1. The number of buffers the reclaimer has put back on the free list.
2. The number of times a backend has run the clocksweep.
3. The number of buffers past which the reclaimer has advanced the clock
sweep (i.e. the number of buffers it had to examine in order to reclaim the
number counted by #1).
4. The number of buffers past which a backend has advanced the clocksweep
(i.e. the number of buffers it had to examine in order to allocate the
number of buffers count by #3).
5. The number of buffers allocated from the freelist which the backend did
not use because they'd been touched (what you're calling
buffers_touched_freelist).

All sound reasonable. To avoid wasting time here, I think it's only worth
doing all of these as DEBUG level messages for now. Then only go through
the overhead of exposing the ones that actually seem relevant. That's what
I did for the 8.3 work, and most of that data at this level was barely
relevant to anyone but me then or since. We don't want the system views to
include so much irrelevant trivia that finding the important parts becomes
overwhelming.

I think we expose far too little information in our system views.
Just to take one example, we expose no useful information about lwlock
acquire or release, but a lot of real-world performance problems are
caused by lwlock contention. There are of course difficulties in
exposing huge numbers of counters, but we're not talking about many
here, so I'd lean toward exposing them in the final patch if they seem
at all useful.

I'd like to see that level of instrumentation--just the debug level
messages--used to quantify the benchmarks that people are running already,
to prove they are testing what they think they are. That would have caught
the test error you already stumbled on for example. Simple paranoia says
there may be more issues like that hidden in here somewhere, and this set
you've identified should find any more of them around.

Right.

If all that matches up so the numbers for the new counters seem sane, I
think that's enough to commit the first basic improvement here. Then make a
second pass over exposing the useful internals that let everyone quantify
either the improvement or things that may need to be tunable.

Well, I posted a patch a bit ago that I think is the first basic
improvement - and none of these counters are relevant to that. It
doesn't add a new background process or anything; it just does pretty
much the same thing we do now with less-painful locking. There are no
water marks to worry about, or tunable thresholds, or anything; and
because it's so much simpler, it's far easier to reason about than the
full patch, which is why I feel quite confident pressing on toward a
commit.

Once that is in, I think we should revisit the idea of a bgreclaimer
process, and see how much more that improves things - if at all - on
top of what that basic patch already does. For that we'll need these
counters, and maybe others. But let's make that phase two.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#81)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Am I understanding you correctly that you also measured context switches
for spinlocks? If so, I don't think that's a valid comparison. LWLocks
explicitly yield the CPU as soon as there's any contention while
spinlocks will, well, spin. Sure they also go to sleep, but it'll take
longer.

No, I measured CPU consumption attributable to s_lock.

(I checked context-switches, too.)

It's also worthwile to remember in such comparisons that lots of the
cost of spinlocks will be in the calling routines, not s_lock() - the
first TAS() is inlined into it. And that's the one that'll incur cache
misses and such...

True. I can check that - I did not.

Note that I'm explicitly *not* doubting the use of a spinlock
itself. Given the short acquiration times and the exclusive use of
exlusive acquiration a spinlock makes more sense. The lwlock's spinlock
alone will have about as much contention.

Right.

I think it might be possible to construct some cases where the spinlock
performs worse than the lwlock. But I think those will be clearly in the
minority. And at least some of those will be fixed by bgwriter. As an
example consider a single backend COPY ... FROM of large files with a
relatively large s_b. That's a seriously bad workload for the current
victim buffer selection because frequently most of the needs to be
searched for a buffer with usagecount 0. And this patch will double the
amount of atomic ops during that.

It will actually be far worse than that, because we'll acquire and
release the spinlock for every buffer over which we advance the clock
sweep, instead of just once for the whole thing. That's reason to
hope that a smart bgreclaimer process may be a good improvement on top
of this. It can do things like advance the clock sweep hand 16
buffers at a time and then sweep them all after-the-fact, reducing
contention on the mutex by an order-of-magnitude, if that turns out to
be an important consideration.

But I think it's right to view that as something we need to test vs.
the baseline established by this patch. What's clear today is that
workloads which stress buffer-eviction fall to pieces, because the
entire buffer-eviction process is essentially single-threaded. One
process can't begin evicting a buffer until another has finished doing
so. This lets multiple backends do that at the same time. We may
find cases where that leads to an unpleasant amount of contention, but
since we have several ideas for how to mitigate that as needs be, I
think it's OK to go ahead. The testing we're doing on the combined
patch is conflating the effects the new locking regimen with however
the bgreclaimer affects things, and it's very clear to me now that we
need to make sure those are clearly separate.

Let me try to quantify that.

Please do.

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

What did events did you profile?

cs.

I welcome further testing and comments, but my current inclination is
to go ahead and push the attached patch. To my knowledge, nobody has
at any point objected to this aspect of what's being proposed, and it
looks safe to me and seems to be a clear win. We can then deal with
the other issues on their own merits.

I've took a look at this, and all the stuff I saw that I disliked were
there before this patch. So +1 for going ahead.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#83)

Re: Scaling shared buffer eviction

On 2014-09-23 19:21:10 -0400, Robert Haas wrote:

On Tue, Sep 23, 2014 at 6:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I think it might be possible to construct some cases where the spinlock
performs worse than the lwlock. But I think those will be clearly in the
minority. And at least some of those will be fixed by bgwriter.

Err, this should read bgreclaimer, not bgwriter.

As an
example consider a single backend COPY ... FROM of large files with a
relatively large s_b. That's a seriously bad workload for the current
victim buffer selection because frequently most of the needs to be
searched for a buffer with usagecount 0. And this patch will double the
amount of atomic ops during that.

It will actually be far worse than that, because we'll acquire and
release the spinlock for every buffer over which we advance the clock
sweep, instead of just once for the whole thing.

I said double, because we already acquire the buffer header's spinlock
every tick.

That's reason to hope that a smart bgreclaimer process may be a good
improvement on top of this.

Right. That's what I was trying to say with bgreclaimer above.

But I think it's right to view that as something we need to test vs.
the baseline established by this patch.

Agreed. I think the possible downsides are at the very fringe of already
bad cases. That's why I agreed that you should go ahead.

Let me try to quantify that.

Please do.

I've managed to find a ~1.5% performance regression. But the setup was
plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
a fillfactor 10 table with the column set to PLAIN storage. With the
resulting table size chosen so it's considerably bigger than s_b, but
smaller than the dirty writeback limit of the kernel.

That's perfectly reasonable.

I can think of a couple other cases, but they're all similarly absurd.

What's interesting is that I can't see in the perf output any real
sign that the buffer mapping locks are slowing things down, but they
clearly are, because when I take this patch and also boost
NUM_BUFFER_PARTITIONS to 128, the performance goes up:

What did events did you profile?

cs.

Ah. My guess is that most of the time will probably actually be spent in
the lwlock's spinlock, not the the lwlock putting itself to sleep.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#84)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 7:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It will actually be far worse than that, because we'll acquire and
release the spinlock for every buffer over which we advance the clock
sweep, instead of just once for the whole thing.

I said double, because we already acquire the buffer header's spinlock
every tick.

Oh, good point.

Let me try to quantify that.

Please do.

I've managed to find a ~1.5% performance regression. But the setup was
plain absurd. COPY ... FROM /tmp/... BINARY; of large bytea datums into
a fillfactor 10 table with the column set to PLAIN storage. With the
resulting table size chosen so it's considerably bigger than s_b, but
smaller than the dirty writeback limit of the kernel.

That's perfectly reasonable.

I can think of a couple other cases, but they're all similarly absurd.

Well, it's not insane to worry about such things, but if you can only
manage 1.5% on such an extreme case, I'm encouraged. This is killing
us on OLTP workloads, and fixing that is a lot more important than a
couple percent on an extreme case.

Ah. My guess is that most of the time will probably actually be spent in
the lwlock's spinlock, not the the lwlock putting itself to sleep.

Ah, OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Gregory Smith

gregsmithpgsql@gmail.com

over 11 years ago

In reply to: Robert Haas (#82)

Re: Scaling shared buffer eviction

On 9/23/14, 7:13 PM, Robert Haas wrote:

I think we expose far too little information in our system views. Just
to take one example, we expose no useful information about lwlock
acquire or release, but a lot of real-world performance problems are
caused by lwlock contention.

I sent over a proposal for what I was calling Performance Events about a
year ago. The idea was to provide a place to save data about lock
contention, weird checkpoint sync events, that sort of thing. Replacing
log parsing to get at log_lock_waits data was my top priority. Once
that's there, lwlocks was an obvious next target. Presumably we just
needed collection to be low enough overhead, and then we can go down to
whatever shorter locks we want; lower the overhead, faster the event we
can measure.

Sometimes the database will never be able to instrument some of its
fastest events without blowing away the event itself. We'll still have
perf / dtrace / systemtap / etc. for those jobs. But those are not the
problems of the average Postgres DBA's typical day.

The data people need to solve this sort of thing in production can't
always show up in counters. You'll get evidence the problem is there,
but you need more details to actually find the culprit. Some info about
the type of lock, tables and processes involved, maybe the query that's
running, that sort of thing. You can kind of half-ass the job if you
make per-tables counter for everything, but we really need more, both to
serve our users and to compare well against what other databases provide
for tools. That's why I was trying to get the infrastructure to capture
all that lock detail, without going through the existing logging system
first.

Actually building Performance Events fell apart on the storage side:
figuring out where to put it all without waiting for a log file to hit
disk. I wanted in-memory storage so clients don't wait for anything,
then a potentially lossy persistence writer. I thought I could get away
with a fixed size buffer like pg_stat_statements uses. That was
optimistic. Trying to do better got me lost in memory management land
without making much progress.

I think the work you've now done on dynamic shared memory gives the
right shape of infrastructure that I could pull this off now. I even
have funding to work on it again, and it's actually the #2 thing I'd
like to take on as I get energy for new feature development. (#1 is the
simple but time consuming job of adding block write counters, the lack
of which which is just killing me on some fast growing installs)

I have a lot of unread messages on this list to sort through right now.
I know I saw someone try to revive the idea of saving new sorts of
performance log data again recently; can't seem to find it again right
now. That didn't seem like it went any farther than thinking about the
specifications though. The last time I jumped right over that and hit a
wall with this one hard part of the implementation instead, low overhead
memory management for saving everything.

--
Greg Smith greg.smith@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Robert Haas (#79)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch I attached the first time was just the last commit in the
git repository where I wrote the patch, rather than the changes that I
made on top of that commit. So, yes, the results from the previous
message are with the patch attached to the follow-up. I just typed
the wrong git command when attempting to extract that patch to attach
it to the email.

Here are some more results. TL;DR: The patch still looks good, but we
should raise the number of buffer mapping partitions as well.

On the IBM POWER7 machine, I ran read-only pgbench tests with 1
client, 8 clients, and all multiples of 16 up to 96. I ran these
tests at scale factor 1000 and scale factor 3000. I tested four
builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204,
that same commit with the number of buffer mapping partitions raised
to 128, that commit with reduce-replacement-locking.patch applied, and
that commit with reduce-replacement-locking.patch applied AND the
number of buffer mapping partitions raised to 128. The results from
each configuration are reported in that order on each of the lines
below; each is the median of three results. shared_buffers=8GB for
all tests.

scale factor 1000
1 8119.907618 8230.853237 8153.515217 8145.045004
8 65457.006762 65826.439701 65851.010116 65703.168020
16 125263.858855 125723.441853 125020.598728 129506.037997
32 176696.288187 182376.232631 178278.917581 186440.340283
48 193251.602743 214243.417591 197958.562641 226782.327868
64 182264.276909 218655.105894 190364.759052 256863.652885
80 171719.210488 203104.673512 179861.241080 274065.020956
96 162525.883898 190960.622943 169759.271356 277820.128782

scale factor 3000
1 7690.357053 7723.925932 7772.207513 7684.079850
8 60789.325087 61688.547446 61485.398967 62546.166411
16 112509.423777 115138.385501 115606.858594 120015.350112
32 147881.211900 161359.994902 153302.501020 173063.463752
48 129748.929652 153986.160920 136164.387103 204935.207578
64 114364.542340 132174.970721 116705.371890 224636.957891
80 101375.265389 117279.931095 102374.794412 232966.076908
96 93144.724830 106676.309224 92787.650325 233862.872939

Analysis:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth. At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients. The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation. At scale factor 3000, the situation
is more complicated. With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master. But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

2. To see the effect of increasing the number of buffer mapping locks
to 128, compare the first TPS number in each line to the second, or
the third to the fourth. Without reduce-replacement-locking.patch,
that's a win at every concurrency level at both scale factors. With
that patch, the 1 and 8 client tests are small losses at scale factor
1000, and the 1 client test is a small loss at scale factor 3000.

The single-client results, which are often a concern for scalability
patches, bear a bit of further comment. In this round of testing,
either patch alone improved things slightly, and both patches together
made them slightly worse. Even if that is reproducible, I don't think
it should be cause for concern, because it tends to indicate (at least
to me) that the shifting around is just the result of slightly
different placement of code across cache lines, or other minor factors
we can't really pin down. So I'm inclined to (a) push
reduce-replacement-locking.patch and then also (b) bump up the number
of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
accordingly so that pg_buffercache doesn't get unhappy).

Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Merlin Moncure

mmoncure@gmail.com

over 11 years ago

In reply to: Robert Haas (#87)

Re: Scaling shared buffer eviction

On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth. At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients. The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation. At scale factor 3000, the situation
is more complicated. With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master. But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

Comments?

Why stop at 128 mapping locks? Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured? I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Merlin Moncure (#88)

Re: Scaling shared buffer eviction

On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth. At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients. The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation. At scale factor 3000, the situation
is more complicated. With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master. But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

Comments?

Why stop at 128 mapping locks? Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured? I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

Good question. My belief is that the number of buffer mapping locks
required to avoid serious contention will be roughly proportional to
the number of hardware threads. At the time the value 16 was chosen,
there were probably not more than 8-core CPUs in common use; but now
we've got a machine with 64 hardware threads and, what do you know but
it wants 128 locks.

I think the long-term solution here is that we need a lock-free hash
table implementation for our buffer mapping tables, because I'm pretty
sure that just cranking the number of locks up and up is going to
start to have unpleasant side effects at some point. We may be able
to buy a few more years by just cranking it up, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#87)

Re: Scaling shared buffer eviction

On 2014-09-25 09:51:17 -0400, Robert Haas wrote:

On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The patch I attached the first time was just the last commit in the
git repository where I wrote the patch, rather than the changes that I
made on top of that commit. So, yes, the results from the previous
message are with the patch attached to the follow-up. I just typed
the wrong git command when attempting to extract that patch to attach
it to the email.

Here are some more results. TL;DR: The patch still looks good, but we
should raise the number of buffer mapping partitions as well.

So I'm inclined to (a) push
reduce-replacement-locking.patch and then also (b) bump up the number
of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
accordingly so that pg_buffercache doesn't get unhappy).

I'm happy with that. I don't think it's likely that a moderate increase
in the number of mapping lwlocks will be noticeably bad for any
workload.

One difference is that the total number of lwlock acquirations will be a
bit higher because currently it's more likely for the old and new to
fall into different partitions. But that's not really significant.

The other difference is the number of cachelines touched. Currently, in
concurrent workloads, there's already lots of L1 cache misses around the
buffer mapping locks because they're exclusively owned by a different
core/socket. So, to be effectively worse, the increase would need to
lead to lower overall cache hit rates by them not being in *any* cache
or displacing other content.

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Merlin Moncure (#88)

Re: Scaling shared buffer eviction

On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote:

On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth. At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients. The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation. At scale factor 3000, the situation
is more complicated. With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master. But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

Comments?

Why stop at 128 mapping locks? Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured? I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#90)

Re: Scaling shared buffer eviction

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128. I haven't verified that myself, but have no reason to doubt
it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#92)

Re: Scaling shared buffer eviction

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128. I haven't verified that myself, but have no reason to doubt
it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Merlin Moncure

mmoncure@gmail.com

over 11 years ago

In reply to: Andres Freund (#91)

Re: Scaling shared buffer eviction

On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Why stop at 128 mapping locks? Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured? I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

I can't do that because I don't have the hardware. I wasn't
suggesting to just set it but to measure the affects of setting it.
But the benefits from going from 16 to 128 are pretty significant at
least on this hardware; I'm curious how much further it can be
pushed...what's wrong with trying it out?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#89)

Re: Scaling shared buffer eviction

On 2014-09-25 10:09:30 -0400, Robert Haas wrote:

I think the long-term solution here is that we need a lock-free hash
table implementation for our buffer mapping tables, because I'm pretty
sure that just cranking the number of locks up and up is going to
start to have unpleasant side effects at some point. We may be able
to buy a few more years by just cranking it up, though.

I think mid to long term we actually need something else than a
hashtable. Capable of efficiently looking for the existance of
'neighboring' buffers so we can intelligently prefetch far enough that
the read actually completes when we get there. Also I'm pretty sure that
we'll need a way to efficiently remove all buffers for a relfilenode
from shared buffers - linearly scanning for that isn't a good
solution. So I think we need a different data structure.

I've played a bit around with just replacing buf_table.c with a custom
handrolled hashtable because I've seen more than one production workload
where hash_search_with_hash_value() is both cpu and cache miss wise
top#1 of profiles. With most calls coming from the buffer mapping and
then from the lock manager.

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is *far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

The aforementioned replacement hash table was a good bit faster for
fully cached workloads - but at the time I wrote I could still make it
crash in very high cache pressure workloads, so that should be taken
with a fair bit of salt.

I think we can comparatively easily get rid of the tablespace in buffer
tags. Getting rid of the database already would be a fair bit harder. I
haven't really managed to get an idea how to remove the fork number
without making the catalog much more complicated. I don't think we can
go too long without at least some of these steps :(.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#93)

Re: Scaling shared buffer eviction

On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128. I haven't verified that myself, but have no reason to doubt
it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Merlin Moncure (#94)

Re: Scaling shared buffer eviction

On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote:

On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Why stop at 128 mapping locks? Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured? I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

I can't do that because I don't have the hardware.

One interesting part of this is making sure it doesn't regress
older/smaller machines. So at least that side you could check...

what's wrong with trying it out?

If somebody is willing to do it: nothing. I'd just much rather do the,
by now proven, simple change before starting with more complex
solutions.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#75)

Re: Scaling shared buffer eviction

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch +
128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch +
128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720

The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer). Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 10000) there is more benefit. Do you see any merit
in pursuing further to accomplish part-2 as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#99

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Andres Freund (#95)

Re: Scaling shared buffer eviction

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

Hmm. Is it the comparing, or calculating the hash? We could precalculate
the hash for RelFileNode+ForkNumber, and store it RelationData. At a
lookup, you'd only need to mix in the block number.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Heikki Linnakangas (#99)

Re: Scaling shared buffer eviction

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Besides making the hashtable more efficent, a smaller key (say, 4 byte
relfilenode, 4 byte blocknumber) would also make using a radix tree or
similar more realistic. I've prototyped that once and it has nice
properties, but the tree is too deep. Obviousy it'd also help making
buffer descriptors smaller, which is also good from a cache efficiency
perspective...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Amit Kapila (#98)

Re: Scaling shared buffer eviction

On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch
+ 128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch
+ 128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720

The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer). Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 10000) there is more benefit. Do you see any merit
in pursuing further to accomplish part-2 as well?

Interesting results. Thanks for gathering this data.

If this is the best we can do with the bgreclaimer, I think the case for
pursuing it is somewhat marginal. The biggest jump you've got seems to be
at scale factor 3000 with 64 clients, where you picked up about 4%. 4%
isn't nothing, but it's not a lot, either. On the other hand, this might
not be the best we can do. There may be further improvements to
bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here?
If we had perfectly removed all of the scalability bottlenecks, what would
we expect to see? You didn't say which machine this testing was done on,
or how many cores it had, but for example on the IBM POWER7 machine, we
probably wouldn't expect the throughput at 64 clients to be 4 times the
throughput at 16 cores because up to 16 clients each one can have a full
CPU core, whereas after that and out to 64 each is getting a hardware
thread, which is not quite as good. Still, we'd expect performance to go
up, or at least not go down. Your data shows a characteristic performance
knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
down, and between 64 and 128 we go down more. You haven't got enough data
points there to show very precisely where the knee is, but unless you
tested this on a smaller box than what you have been using, we're certainly
hitting the knee sometime before we run out of physical cores. That
implies a remaining contention bottleneck.

My results from yesterday were a bit different. I tested 1 client, 8
clients, and multiples of 16 clients out to 96. With
reduce-replacement-locking.patch + 128 buffer mapping partitions,
performance continued to rise all the way out to 96 clients. It definitely
wasn't linearly, but it went up, not down. I don't know why this is
different from what you are seeing. Anyway there's a little more ambiguity
there about how much contention remains, but my bet is that there is at
least some contention that we could still hope to remove. We need to
understand where that contention is. Are the buffer mapping locks still
contended? Is the new spinlock contended? Are there other contention
points? I won't be surprised if it turns out that the contention is on the
new spinlock and that a proper design for bgreclaimer is the best way to
remove that contention .... but I also won't be surprised if it turns out
that there are bigger wins elsewhere. So I think you should try to figure
out where the remaining contention is first, and then we can discuss what
to do about it.

On another point, I think it would be a good idea to rebase the bgreclaimer
patch over what I committed, so that we have a clean patch against master
to test with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#102

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Andres Freund (#100)

Re: Scaling shared buffer eviction

On 09/26/2014 03:26 PM, Andres Freund wrote:

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Hmm. We could provide a custom compare function instead of relying on
memcmp. We can do somewhat better than generic memcmo when we know that
the BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and
it's always exactly 20 bytes. I wonder if you're actually just seeing a
cache miss showing up in the profile, though.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#100)

Re: Scaling shared buffer eviction

On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Would it be faster to test the relfilenode first, and then test the
rest only if that matches?

One idea for improving this is to get rid of relation forks. Like all
true PostgreSQL patriots, I love the free space map and the visibility
map passionately, and I believe that those features are one of the
biggest contributors to the success of the project over the last
half-decade. But I'd love them even more if they didn't triple our
rate of inode consumption and bloat our buffer tags. More, it's just
not an extensible mechanism: too many things have to loop over all
forks, and it just doesn't scale to keep adding more of them. If we
added a metapage to each heap, we could have the FSM and VM have their
own relfilenode and just have the heap point at them. Or (maybe
better still) we could store the data in the heap itself.

It would be a lot of work, though. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Ants Aasma

ants@cybertec.at

over 11 years ago

In reply to: Andres Freund (#100)

1 attachment(s)

Re: Scaling shared buffer eviction

On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachments:

optimized-buffer-tag-cmp.patchtext/x-patch; charset=US-ASCII; name=optimized-buffer-tag-cmp.patchDownload

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 7a38f2f..ba37b5f 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -34,6 +34,7 @@ typedef struct
 
 static HTAB *SharedBufHash;
 
+static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n);
 
 /*
  * Estimate space needed for mapping hashtable
@@ -46,6 +47,19 @@ BufTableShmemSize(int size)
 }
 
 /*
+ * Compare contents of two buffer tags. We use this instead of the default
+ * memcmp to minimize comparison overhead.
+ */
+static int
+BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n)
+{
+	Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64));
+	return (a->blockNum == b->blockNum
+			&& ((uint64*)a)[0] == ((uint64*)b)[0]
+			&& ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1;
+}
+
+/*
  * Initialize shmem hash table for mapping buffers
  *		size is the desired hash table size (possibly more than NBuffers)
  */
@@ -61,6 +75,7 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.hash = tag_hash;
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
+	info.match = BufTableCmpBufferTag;
 
 	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
 								  size, size,

#105

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Heikki Linnakangas (#102)

Re: Scaling shared buffer eviction

On 2014-09-26 16:47:55 +0300, Heikki Linnakangas wrote:

On 09/26/2014 03:26 PM, Andres Freund wrote:

On 2014-09-26 15:04:54 +0300, Heikki Linnakangas wrote:

On 09/25/2014 05:40 PM, Andres Freund wrote:

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is*far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

Hmm. Is it the comparing, or calculating the hash?

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Hmm. We could provide a custom compare function instead of relying on
memcmp. We can do somewhat better than generic memcmo when we know that the
BufferTag is MAXALIGNed (is it? at least it's 4 bytes aligned), and it's
always exactly 20 bytes.

That might give a little benefit. I haven't experimented with that with
dynahash.c. I've compared memcmp() and custom comparison with my own
hashtable and there were some differences, but neglegible. The biggest
was using 64bit compares. Either way, it all ends up being rather branch
heavy with high misprediction rates.

I wonder if you're actually just seeing a cache miss showing up in the
profile, though.

I don't think so. I hacked (by moving it to the end of
RelfileNode/BufferTag and comparing only the front port) the tablespace
out of buffer tags and that produced measurable benefits.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#101)

Re: Scaling shared buffer eviction

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 26, 2014 at 7:40 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today
I have tried some tests, the results of which are as follows:

Scale Factor - 3000, Shared_buffer - 8GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch
+ 128 Buf Partitions 157732 229547 271536 245295
scalable_buffer_eviction_v9.patch 163762 230753 275147 248309

Scale Factor - 3000, Shared_buffer - 8GB

Typo here Scale Factor - 3000, Shared_buffer - *2*GB

Patch_Ver/Client_Count 16 32 64 128 reduce-replacement-locking.patch

+ 128 Buf Partitions 157781 212134 202209 171176
scalable_buffer_eviction_v9.patch 160301 213922 208680 172720

The results indicates that in all cases there is benefit by doing
part-2 (bgreclaimer). Though the benefit at this configuration is
not high, but might be at some higher configurations
(scale factor - 10000) there is more benefit. Do you see any merit
in pursuing further to accomplish part-2 as well?

Interesting results. Thanks for gathering this data.

One more point I have missed is that above data is using
"-M prepared" option of pgbench.

If this is the best we can do with the bgreclaimer, I think the case for
pursuing it is somewhat marginal. The biggest jump you've got seems to be
at scale factor 3000 with 64 clients, where you picked up about 4%. 4%
isn't nothing, but it's not a lot, either. On the other hand, this might
not be the best we can do. There may be further improvements to
bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here?
If we had perfectly removed all of the scalability bottlenecks, what would
we expect to see? You didn't say which machine this testing was done on

It was IBM POWER7
Sorry, I should have mentioned it.

, or how many cores it had, but for example on the IBM POWER7 machine, we
probably wouldn't expect the throughput at 64 clients to be 4 times the
throughput at 16 cores because up to 16 clients each one can have a full
CPU core, whereas after that and out to 64 each is getting a hardware
thread, which is not quite as good. Still, we'd expect performance to go
up, or at least not go down. Your data shows a characteristic performance
knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
down,

Here another point worth noting is that it goes down between
32 and 64 when shared_buffers is 2GB, however when
shared_buffers are 8GB it doesn't go down between 32 and 64.

and between 64 and 128 we go down more. You haven't got enough data
points there to show very precisely where the knee is, but unless you
tested this on a smaller box than what you have been using, we're certainly
hitting the knee sometime before we run out of physical cores. That
implies a remaining contention bottleneck.

My results from yesterday were a bit different. I tested 1 client, 8
clients, and multiples of 16 clients out to 96. With
reduce-replacement-locking.patch + 128 buffer mapping partitions,
performance continued to rise all the way out to 96 clients. It definitely
wasn't linearly, but it went up, not down. I don't know why this is
different from what you are seeing.

I think it is almost same if we consider same configuration
(scale_factor - 3000, shared_buffer - 8GB).

Anyway there's a little more ambiguity there about how much contention
remains, but my bet is that there is at least some contention that we could
still hope to remove. We need to understand where that contention is. Are
the buffer mapping locks still contended? Is the new spinlock contended?
Are there other contention points? I won't be surprised if it turns out
that the contention is on the new spinlock and that a proper design for
bgreclaimer is the best way to remove that contention .... but I also won't
be surprised if it turns out that there are bigger wins elsewhere. So I
think you should try to figure out where the remaining contention is first,
and then we can discuss what to do about it.

Make sense.

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a clean patch
against master to test with.

I think this also makes sense, however I think it is better to first
see where is the bottleneck.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#107

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#103)

Re: Scaling shared buffer eviction

On 2014-09-26 09:59:41 -0400, Robert Haas wrote:

On Fri, Sep 26, 2014 at 8:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

Would it be faster to test the relfilenode first, and then test the
rest only if that matches?

I tried that and I couldn't see that much benefit. Check my message to
Heikki.

One idea for improving this is to get rid of relation forks.

I think that's the hardest end to start from. A cool goal, but
hard. Getting rid of the tablespace sound comparatively simple. And even
getting rid of the database in the buffer tag seems to be simpler
although already pretty hard.

Like all
true PostgreSQL patriots, I love the free space map and the visibility
map passionately, and I believe that those features are one of the
biggest contributors to the success of the project over the last
half-decade. But I'd love them even more if they didn't triple our
rate of inode consumption and bloat our buffer tags. More, it's just
not an extensible mechanism: too many things have to loop over all
forks, and it just doesn't scale to keep adding more of them. If we
added a metapage to each heap, we could have the FSM and VM have their
own relfilenode and just have the heap point at them. Or (maybe
better still) we could store the data in the heap itself.

It would be a lot of work, though. :-(

Yea, it's really hard. And nearly impossible to do without breaking
binary compatibility.

What I've been wondering about was to give individual forks their own
relfilenodes and manage them via columns in pg_class. But that's also a
heck of a lot of work and gets complicated for unlogged relations
because we need to access those during recovery when we don't have
catalog access yet and can't access all databases anyway.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Ants Aasma (#104)

Re: Scaling shared buffer eviction

On 2014-09-26 17:01:52 +0300, Ants Aasma wrote:

On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Neither, really. The hash calculation is visible in the profile, but not
that pronounced yet. The primary thing noticeable in profiles (besides
cache efficiency) is the comparison of the full tag after locating a
possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Heh. Yea. As I wrote to Heikki, 64bit compares were the thing showing
most benefits - at least with my own hashtable implementation.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Fully cached readonly pgbench workloads with -M prepared already show
it. But it gets more pronounced for workload that access buffers at a
higher frequency.

At two customers I've seen this really badly in the profile because they
have OLTP statements that some index nested loops. Often looking the
same buffer up *over and over*. There often isn't a better plan (with
current pg join support at least) for joining a somewhat small number of
rows out of a large table to small/mid sized table.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Andres Freund (#108)

Re: Scaling shared buffer eviction

Part of this patch was already committed, and the overall patch has had
its fair share of review for this commitfest, so I'm marking this as
"Returned with feedback". The benchmark results for the bgreclaimer
showed a fairly small improvement, so it doesn't seem like anyone's
going to commit the rest of the patch soon, not without more discussion
and testing anyway. Let's not hold up the commitfest for that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#77)

Re: Scaling shared buffer eviction

On 2014-09-25 16:50:44 +0200, Andres Freund wrote:

On 2014-09-25 10:44:40 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128. I haven't verified that myself, but have no reason to doubt
it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

Another thought is that we should test what impact your atomics-based
lwlocks have on this.

Yes, I'd planned to test that as well. I think that it will noticeably
reduce the need to increase the number of partitions for workloads that
fit into shared_buffers. But it won't do much about exclusive
acquirations of the buffer mapping locks. So I think there's independent
benefit of increasing the number.

Here we go.

Postgres was configured with.
-c shared_buffers=8GB \
-c log_line_prefix="[%m %p] " \
-c log_min_messages=debug1 \
-p 5440 \
-c checkpoint_segments=600
-c max_connections=200

Each individual measurement (#TPS) is the result of a
pgbench -h /tmp/ -p 5440 postgres -n -M prepared -c $clients -j $clients -S -T 10
run.

Master is as of ef8863844bb0b0dab7b92c5f278302a42b4bf05a.

First, a scale 200 run. That fits entirely into shared_buffers:

#scale #client #partitions #TPS
200 1 16 8353.547724 8145.296655 8263.295459
200 16 16 171014.763118 193971.091518 133992.128348
200 32 16 259119.988034 234619.421322 201879.618322
200 64 16 178909.038670 179425.091562 181391.354613
200 96 16 141402.895201 138392.705402 137216.416951
200 128 16 125643.089677 124465.288860 122527.209125

(other runs here stricken, they were contorted due some concurrent
activity. But nothing interesting).

So, there's quite some variation in here. Not very surprising given the
short runtimes, but still.

Looking at a profile nearly all the contention is around
GetSnapshotData(). That might hide the interesting scalability effects
of the partition number. So I next tried my rwlock-contention branch.

#scale #client #partitions #TPS
200 1 1 8540.390223 8285.628397 8497.022656
200 16 1 136875.484896 164302.769380 172053.413980
200 32 1 308624.650724 240502.019046 260825.231470
200 64 1 453004.188676 406226.943046 406973.325822
200 96 1 442608.459701 450185.431848 445549.710907
200 128 1 487138.077973 496233.594356 457877.992783

200 1 16 9477.217454 8181.098317 8457.276961
200 16 16 154224.573476 170238.637315 182941.035416
200 32 16 302230.215403 285124.708236 265917.729628
200 64 16 405151.647136 443473.797835 456072.782722
200 96 16 443360.377281 457164.981119 474049.685940
200 128 16 490616.257063 458273.380238 466429.948417

200 1 64 8410.981874 11554.708966 8359.294710
200 16 64 139378.312883 168398.919590 166184.744944
200 32 64 288657.701012 283588.901083 302241.706222
200 64 64 424838.919754 416926.779367 436848.292520
200 96 64 462352.017671 446384.114441 483332.592663
200 128 64 471578.594596 488862.395621 466692.726385

200 1 128 8350.274549 8140.699687 8305.975703
200 16 128 144553.966808 154711.927715 202437.837908
200 32 128 290193.349170 213242.292597 261016.779185
200 64 128 413792.389493 431267.716855 456587.450294
200 96 128 490459.212833 456375.442210 496430.996055
200 128 128 470067.179360 464513.801884 483485.000502

Not much there either.

So, on to the next scale, 1000. That doesn't fit into s_b anymore.

master:
#scale #client #partitions #TPS
1000 1 1 7378.370717 7110.988121 7164.977746
1000 16 1 66439.037413 85151.814130 85047.296626
1000 32 1 71505.487093 75687.291060 69803.895496
1000 64 1 42148.071099 41934.631603 43253.528849
1000 96 1 33760.812746 33969.800564 33598.640121
1000 128 1 30382.414165 30047.284982 30144.576494

1000 1 16 7228.883843 9479.793813 7217.657145
1000 16 16 105203.710528 112375.187471 110919.986283
1000 32 16 146294.286762 145391.938025 144620.709764
1000 64 16 134411.772164 134536.943367 136196.793573
1000 96 16 107626.878208 105289.783922 96480.468107
1000 128 16 92597.909379 86128.040557 92417.727720

1000 1 64 7130.392436 12801.641683 7019.999330
1000 16 64 120180.196384 125319.373819 126137.930478
1000 32 64 181876.697461 190578.106760 189412.973015
1000 64 64 216233.590299 222561.774501 225802.194056
1000 96 64 171928.358031 165922.395721 168283.712990
1000 128 64 139303.139631 137564.877450 141534.449640

1000 1 128 8215.702354 7209.520152 7026.888706
1000 16 128 116196.740200 123018.284948 127045.761518
1000 32 128 183391.488566 185428.757458 185732.926794
1000 64 128 218547.133675 218096.002473 208679.436158
1000 96 128 155209.830821 156327.200412 157542.582637
1000 128 128 131127.769076 132084.933955 124706.336737

rwlock:
#scale #client #partitions #TPS
1000 1 1 7377.270393 7494.260136 7207.898866
1000 16 1 79289.755569 88032.480145 86810.772569
1000 32 1 83006.336151 88961.964680 88508.832253
1000 64 1 44135.036648 46582.727314 45119.421278
1000 96 1 35036.174438 35687.025568 35469.127697
1000 128 1 30597.870830 30782.335225 30342.454439

1000 1 16 7114.602838 7265.863826 7205.225737
1000 16 16 128507.292054 131868.678603 124507.097065
1000 32 16 212779.122153 185666.608338 210714.373254
1000 64 16 239776.079534 239923.393293 242476.922423
1000 96 16 169240.934839 166021.430680 169187.643644
1000 128 16 136601.409985 139340.961857 141731.068752

1000 1 64 13271.722885 11348.028311 12531.188689
1000 16 64 129074.053482 125334.720264 125140.499619
1000 32 64 198405.463848 196605.923684 198354.818005
1000 64 64 250463.474112 249543.622897 251517.159399
1000 96 64 251715.751133 254168.028451 251502.783058
1000 128 64 243596.368933 234671.592026 239123.259642

1000 1 128 7376.371403 7301.077478 7240.526379
1000 16 128 127992.070372 133537.637394 123382.418747
1000 32 128 185807.703422 194303.674428 184919.586634
1000 64 128 270233.496350 271576.483715 262281.662510
1000 96 128 266023.529574 272484.352878 271921.597420
1000 128 128 260004.301457 266710.469926 263713.245868

Based on this I think we can fairly conclude that increasing the number
of partitions is quite the win on larger x86 machines too. Independent
of the rwlock patch, although it moves the contention points to some
degree.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 20140925145044.GH9633@alap3.anarazel.de

#111

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Andres Freund (#110)

Re: Scaling shared buffer eviction

On 2014-10-01 20:54:39 +0200, Andres Freund wrote:

Here we go.

Postgres was configured with.
-c shared_buffers=8GB \
-c log_line_prefix="[%m %p] " \
-c log_min_messages=debug1 \
-p 5440 \
-c checkpoint_segments=600
-c max_connections=200

Robert reminded me that I missed to report the hardware aspect of
this...

4x E5-4620 @ 2.20GHz: 32 cores, 64 threads
256GB RAM

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#96)

Re: Scaling shared buffer eviction

On 2014-09-25 10:42:29 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128. I haven't verified that myself, but have no reason to doubt
it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

Given that the results look good, do you plan to push this?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#112)

Re: Scaling shared buffer eviction

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

If so, and if you don't have any reservations, yeah I'll go change it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#113)

Re: Scaling shared buffer eviction

On 2014-10-02 10:40:30 -0400, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64)
or something like that?

If so, and if you don't have any reservations, yeah I'll go change it.

Yes, I'm happy with going forward.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#114)

Re: Scaling shared buffer eviction

On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-02 10:40:30 -0400, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64)
or something like that?

Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest
thing, and I don't see any reason to assume that, even if we're making
it true for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Robert Haas (#115)

Re: Scaling shared buffer eviction

On 2014-10-02 10:56:05 -0400, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:44 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-02 10:40:30 -0400, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Yes. Now that I think about it I wonder if we shouldn't define MAX_SIMUL_LWLOCKS like
#define MAX_SIMUL_LWLOCKS (NUM_BUFFER_PARTITIONS + 64)
or something like that?

Nah. That assumes NUM_BUFFER_PARTITIONS will always be the biggest
thing, and I don't see any reason to assume that, even if we're making
it true for now.

The reason I'm suggesting is that NUM_BUFFER_PARTITIONS (and
NUM_LOCK_PARTITIONS) are the cases where we can expect many lwlocks to
be held at the same time. It doesn't seem friendly to users
experimenting with changing this to know about a define that's private
to lwlock.c.
But I'm fine with doing this not at all or separately - there's no need
to actually do it together.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Heikki Linnakangas

hlinnakangas@vmware.com

over 11 years ago

In reply to: Robert Haas (#113)

Re: Scaling shared buffer eviction

On 10/02/2014 05:40 PM, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Hmm, do we actually ever need to hold all the buffer partition locks at
the same time? At a quick search for NUM_BUFFER_PARTITIONS in the code,
I couldn't find any place where we'd do that. I bumped up
NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did
"make check". It passed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Heikki Linnakangas (#117)

Re: Scaling shared buffer eviction

On 2014-10-02 20:04:58 +0300, Heikki Linnakangas wrote:

On 10/02/2014 05:40 PM, Robert Haas wrote:

On Thu, Oct 2, 2014 at 10:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:

OK.

Given that the results look good, do you plan to push this?

By "this", you mean the increase in the number of buffer mapping
partitions to 128, and a corresponding increase in MAX_SIMUL_LWLOCKS?

Hmm, do we actually ever need to hold all the buffer partition locks at the
same time? At a quick search for NUM_BUFFER_PARTITIONS in the code, I
couldn't find any place where we'd do that. I bumped up
NUM_BUFFER_PARTITIONS to 128, but left MAX_SIMUL_LWLOCKS at 100, and did
"make check". It passed.

Do a make check-world and it'll hopefully fail ;). Check
pg_buffercache_pages.c.

I'd actually quite like to have a pg_buffercache version that, at least
optionally, doesn't do this, but that's a separate thing.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Andres Freund (#118)

Re: Scaling shared buffer eviction

On Thu, Oct 2, 2014 at 1:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Do a make check-world and it'll hopefully fail ;). Check
pg_buffercache_pages.c.

Yep. Committed, with an update to the comments in lwlock.c to allude
to the pg_buffercache issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Robert Haas (#101)

1 attachment(s)

Re: Scaling shared buffer eviction

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections =300
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

Below data is median of 3 runs.

patch_ver/client_count 1 8 32 64 128 256 HEAD 18884 118628 251093 216294
186625 177505 PATCH 18743 122578 247243 205521 179712 175031

Here we can see that the performance dips at higher client
count(>=32) which was quite surprising for me, as I was expecting
it to improve, because bgreclaimer reduces the contention by making
buffers available on free list. So I tried to analyze the situation by
using perf and found that in above configuration, there is a contention
around freelist spinlock with HEAD and the same is removed by Patch,
but still the performance goes down with Patch. On further analysis, I
observed that actually after Patch there is an increase in contention
around ProcArrayLock (shared LWlock) via GetSnapshotData which
sounds bit odd, but that's what I can see in profiles. Based on analysis,
few ideas which I would like to further investigate are:
a. As there is an increase in spinlock contention, I would like to check
with Andres's latest patch which reduces contention around shared
lwlocks.
b. Reduce some instructions added by patch in StrategyGetBuffer(),
like instead of awakening bgreclaimer at low threshold, awaken when
it tries to do clock sweep.

Thoughts?

Below is the profile data for 64 and 128 client count:

Head - 64 client count
-------------------------------------
+   8.93%         postgres  postgres                      [.] s_lock

    7.83%          swapper  [unknown]                     [H]
0x00000000011bc5ac
+   3.09%         postgres  postgres                      [.]
GetSnapshotData
+   3.06%         postgres  postgres                      [.] tas

+ 2.49% postgres postgres [.] AllocSetAlloc

+   2.43%         postgres  postgres                      [.]
hash_search_with_hash_value
+   2.13%         postgres  postgres                      [.] _bt_compare

Detailed Data
------------------------
- 8.93% postgres postgres [.] s_lock

- s_lock

- 4.97% s_lock

- 1.63% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

- 1.63% ReleaseAndReadBuffer

- 0.93% index_fetch_heap

- index_getnext

- 0.93% IndexNext

ExecScanFetch

- 0.69% _bt_relandgetbuf

_bt_search

_bt_first

btgettuple

- index_getnext

- 0.69% IndexNext

ExecScanFetch

- 1.39% LWLockAcquireCommon

- LWLockAcquire

- 1.38% GetSnapshotData

- GetTransactionSnapshot

- 0.70% exec_bind_message

- 0.68% PortalStart

exec_bind_message

- 1.37% LWLockRelease

- 1.37% GetSnapshotData

- GetTransactionSnapshot

- 0.69% exec_bind_message

- 0.68% PortalStart

exec_bind_message

PostgresMain

- 1.07% StrategyGetBuffer

- 1.06% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

- 1.06% ReleaseAndReadBuffer

- 0.62% index_fetch_heap

index_getnext

- 0.95% LWLockAcquireCommon

- LWLockAcquire

- 0.90% GetSnapshotData

GetTransactionSnapshot

- 0.94% LWLockRelease

- 0.90% GetSnapshotData

GetTransactionSnapshot

7.83% swapper [unknown] [H]
0x00000000011bc5ac
- 3.09% postgres postgres [.]
GetSnapshotData
- GetSnapshotData

- 3.06% GetSnapshotData

- 3.06% GetTransactionSnapshot

- 1.54% PortalStart

exec_bind_message

move_buffers_to_freelist_by_bgreclaimer_v1 - 64 Client count
----------------------------------------------------------------------------------------------

+  11.52%         postgres  postgres             [.] s_lock
    7.57%          swapper  [unknown]            [H] 0x00000000011d9034
+   3.54%         postgres  postgres             [.] tas
+   3.02%         postgres  postgres             [.] GetSnapshotData
+   2.47%         postgres  postgres             [.]
hash_search_with_hash_value
+   2.33%         postgres  postgres             [.] AllocSetAlloc
+   2.03%         postgres  postgres             [.] _bt_compare
+   1.89%         postgres  postgres             [.] calc_bucket

Detailed Data
---------------------

- 11.52% postgres postgres [.] s_lock

- s_lock

- 6.57% s_lock

- 2.72% LWLockAcquireCommon

- LWLockAcquire

- 2.71% GetSnapshotData

- GetTransactionSnapshot

- 1.38% exec_bind_message

- 1.33% PortalStart

exec_bind_message

- 2.69% LWLockRelease

- 2.69% GetSnapshotData

- GetTransactionSnapshot

- 1.35% exec_bind_message

PostgresMain

- 1.34% PortalStart

exec_bind_message

- 1.65% LWLockAcquireCommon

- LWLockAcquire

- 1.59% GetSnapshotData

- GetTransactionSnapshot

- 0.80% exec_bind_message

PostgresMain

- 0.79% PortalStart

exec_bind_message

- 0.79% PortalStart

exec_bind_message

- 1.62% LWLockRelease

- 1.58% GetSnapshotData

- GetTransactionSnapshot

- 0.79% exec_bind_message

PostgresMain

- 0.79% PortalStart

exec_bind_message

PostgresMain

- 0.63% hash_search_with_hash_value

BufTableDelete

BufferAlloc

- 0.59% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

Head - 128 Client count
---------------------------------------
+  18.39%         postgres  postgres            [.] s_lock

6.72% swapper [unknown] [H] 0x00000000011bc390

+ 3.37% postgres postgres [.] GetSnapshotData

+ 2.11% postgres postgres [.] tas

+ 2.05% postgres postgres [.] tas

+   1.82%         postgres  postgres            [.]
hash_search_with_hash_value
+   1.77%         postgres  postgres            [.] AllocSetAlloc

1.52% postgres [unknown] [H] 0x00000000012fdc00

+ 1.45% postgres postgres [.] tas

+ 1.42% postgres postgres [.] _bt_compare

- 18.39% postgres postgres [.] s_lock

- s_lock

- 12.35% s_lock

- 7.52% StrategyGetBuffer

BufferAlloc

ReadBuffer_common

- 1.86% LWLockAcquireCommon

- LWLockAcquire

- 1.83% GetSnapshotData

- GetTransactionSnapshot

- 0.95% exec_bind_message

- 0.88% PortalStart

exec_bind_message

- 1.78% LWLockRelease

- 1.76% GetSnapshotData

- GetTransactionSnapshot

- 0.91% exec_bind_message

- 0.86% PortalStart

exec_bind_message

- 0.60% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 0.58% hash_search_with_hash_value

- 0.58% BufTableDelete

BufferAlloc

- 3.18% StrategyGetBuffer

BufferAlloc

- 0.88% LWLockAcquireCommon

- 0.87% LWLockAcquireCommon

- LWLockAcquire

- 0.81% GetSnapshotData

GetTransactionSnapshot

- 0.84% LWLockRelease

- 0.83% LWLockRelease

- 0.79% GetSnapshotData

GetTransactionSnapshot

- 0.55% hash_search_with_hash_value

BufTableDelete

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

0.55% ReleaseAndReadBuffer

- 0.54% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

ReadBuffer_common

ReadBufferExtended

- ReadBuffer

0.54% ReleaseAndReadBuffer

move_buffers_to_freelist_by_bgreclaimer_v1 - 128 Client count
----------------------------------------------------------------------------------------------

+ 13.64% postgres postgres [.] s_lock

8.19% swapper [unknown] [H] 0x0000000000000c04

+ 3.62% postgres postgres [.] GetSnapshotData

+ 2.40% postgres postgres [.] calc_bucket

+ 2.38% postgres postgres [.] tas

+ 2.38% postgres postgres [.]
hash_search_with_hash_value
2.02% postgres [unknown] [H] 0x0000000000000f80

+ 1.73% postgres postgres [.] AllocSetAlloc

+ 1.68% postgres postgres [.] tas

Detailed Data
-----------------------
- 13.64% postgres postgres [.] s_lock
- s_lock
- 8.76% s_lock
- 3.03% LWLockAcquireCommon
- LWLockAcquire
- 2.97% GetSnapshotData
- GetTransactionSnapshot
- 1.55% exec_bind_message
0
- 1.42% PortalStart
exec_bind_message
- 2.87% LWLockRelease

- 2.82% GetSnapshotData

- GetTransactionSnapshot

- 1.46% exec_bind_message

- 1.36% PortalStart

exec_bind_message

- 1.35% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 1.29% hash_search_with_hash_value

- 1.29% BufTableDelete

BufferAlloc

- 1.19% LWLockAcquireCommon

- LWLockAcquire

- 1.11% GetSnapshotData

- GetTransactionSnapshot

- 0.56% exec_bind_message

PostgresMain

- 0.55% PortalStart

exec_bind_message

- 1.15% LWLockRelease

- 1.08% GetSnapshotData

- GetTransactionSnapshot

- 0.55% exec_bind_message

- 0.53% PortalStart

exec_bind_message

- 1.12% hash_search_with_hash_value

BufTableDelete

BufferAlloc

- 1.10% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

move_buffers_to_freelist_by_bgreclaimer_v1.patchapplication/octet-stream; name=move_buffers_to_freelist_by_bgreclaimer_v1.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 354165b..fb67cc9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -815,6 +815,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Number of buffers allocated</entry>
      </row>
      <row>
+      <entry><structfield>buffers_backend_clocksweep</></entry>
+      <entry><type>bigint</type></entry>
+      <entry>Number of buffer allocations that are not satisfied from
+      freelist</entry>
+     </row>
+     <row>
       <entry><structfield>stats_reset</></entry>
       <entry><type>timestamp with time zone</type></entry>
       <entry>Time at which these statistics were last reset</entry>
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
  *	 AuxiliaryProcessMain
  *
  *	 The main entry point for auxiliary processes, such as the bgwriter,
- *	 walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ *	 walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ *	 memory checker code.
  *
  *	 This code is here just because of historical reasons.
  */
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case WalReceiverProcess:
 				statmsg = "wal receiver process";
 				break;
+			case BgReclaimerProcess:
+				statmsg = "reclaimer process";
+				break;
 			default:
 				statmsg = "??? process";
 				break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			WalReceiverMain();
 			proc_exit(1);		/* should never return */
 
+		case BgReclaimerProcess:
+			/* don't set signals, bgreclaimer has its own agenda */
+			BackgroundReclaimerMain();
+			proc_exit(1);		/* should never return */
+
 		default:
 			elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
 			proc_exit(1);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a819952..5b3d46b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -744,6 +744,7 @@ CREATE VIEW pg_stat_bgwriter AS
         pg_stat_get_buf_written_backend() AS buffers_backend,
         pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
         pg_stat_get_buf_alloc() AS buffers_alloc,
+        pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
         pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_user_mappings AS
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-	pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+	fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+	walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..1c64900
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,306 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5.  It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they only need to do if they don't find the next candidate buffer from
+ * the freelist).  In the best scenario all requests for shared buffers will
+ * be fulfilled from freelist as the background reclaimer process always tries
+ * to maintain buffers on freelist.  However, regular backends are still
+ * empowered to run clock sweep to find a usable buffer if the bgreclaimer
+ * fails to maintain enough buffers on freelist.
+ *
+ * The bgreclaimer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void ReclaimQuickDieHandler(SIGNAL_ARGS);
+static void ReclaimSigHupHandler(SIGNAL_ARGS);
+static void ReclaimShutdownHandler(SIGNAL_ARGS);
+static void ReclaimSigUsr1Handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext bgreclaim_context;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.  (bgreclaim probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us.
+	 *
+	 * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+	 * handler is still needed for latch wakeups.
+	 */
+	pqsignal(SIGHUP, ReclaimSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, ReclaimShutdownHandler);		/* shutdown */
+	pqsignal(SIGQUIT, ReclaimQuickDieHandler);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, ReclaimSigUsr1Handler);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  As of now, the memory allocation can be done
+	 * only during processing of SIGHUP signal.
+	 */
+	bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+											 "Background Reclaim",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(bgreclaim_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().  We don't have very many resources to worry
+		 * about in bgreclaim, but we do have buffers and file descriptors.
+		 * Currently we don't use LWLocks in bgreclaimer, however it can be
+		 * added in future in bgreclaimer or in config processing path and there
+		 * is no saving from not doing so.
+		 */
+		LWLockReleaseAll();
+		UnlockBuffers();
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(bgreclaim_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  We don't want to be
+		 * filling the error logs as fast as we can.
+		 */
+		pg_usleep(1000000L);
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		int			rc;
+
+		/* Clear any already-pending wakeups */
+		ResetLatch(&MyProc->procLatch);
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Normal exit from the bgreclaimer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Backend will signal bgreclaimer when the number of buffers in
+		 * freelist falls below than low water mark of freelist.
+		 */
+		rc = WaitLatch(&MyProc->procLatch,
+					   WL_LATCH_SET | WL_POSTMASTER_DEATH,
+					   -1);
+
+		if (rc & WL_LATCH_SET)
+			BgMoveBuffersToFreelist();
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (rc & WL_POSTMASTER_DEATH)
+			exit(1);
+	}
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * ReclaimQuickDieHandler() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+ReclaimQuickDieHandler(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ReclaimSigHupHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_SIGHUP = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReclaimShutdownHandler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	shutdown_requested = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+
+	errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+ReclaimSigUsr1Handler(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	latch_sigusr1_handler();
+
+	errno = save_errno;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f41a5..7475e5a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5021,6 +5021,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 	globalStats.buf_written_backend += msg->m_buf_written_backend;
 	globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
 	globalStats.buf_alloc += msg->m_buf_alloc;
+	globalStats.buf_backend_clocksweep += msg->m_buf_backend_clocksweep;
 }
 
 /* ----------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6220a8e..c130fa2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
  * authorization phase).  This is used mainly to keep track of how many
  * children we have and send them appropriate signals when necessary.
  *
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list.  Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends.  dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list.  Autovacuum worker and
+ * walsender are in it.  Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client.  We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
  *
  * Background workers that request shared memory access during registration are
  * in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
 			AutoVacPID = 0,
 			PgArchPID = 0,
 			PgStatPID = 0,
-			SysLoggerPID = 0;
+			SysLoggerPID = 0,
+			BgReclaimerPID = 0;
 
 /* Startup/shutdown state */
 #define			NoShutdown		0
@@ -269,13 +270,13 @@ static bool RecoveryError = false;		/* T if WAL recovery failed */
  * hot standby during archive recovery.
  *
  * When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries.  When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL.  If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries.  When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
  *
  * Normal child backends can only be launched when we are in PM_RUN or
  * PM_HOT_STANDBY state.  (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
-	 * postmaster/syslogger.c, postmaster/bgworker.c and
-	 * postmaster/checkpointer.c.
+	 * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+	 * and postmaster/checkpointer.c.
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1589,7 +1591,8 @@ ServerLoop(void)
 		/*
 		 * If no background writer process is running, and we are not in a
 		 * state that prevents it, start one.  It doesn't matter if this
-		 * fails, we'll just try again later.  Likewise for the checkpointer.
+		 * fails, we'll just try again later.  Likewise for the checkpointer
+		 * and bgreclaimer.
 		 */
 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 			pmState == PM_HOT_STANDBY)
@@ -1598,6 +1601,8 @@ ServerLoop(void)
 				CheckpointerPID = StartCheckpointer();
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 		}
 
 		/*
@@ -2336,6 +2341,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(SysLoggerPID, SIGHUP);
 		if (PgStatPID != 0)
 			signal_child(PgStatPID, SIGHUP);
+		if (BgReclaimerPID != 0)
+			signal_child(BgReclaimerPID, SIGHUP);
 
 		/* Reload authentication config files too */
 		if (!load_hba())
@@ -2404,6 +2411,9 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				/* and the bgreclaimer too */
+				if (BgReclaimerPID != 0)
+					signal_child(BgReclaimerPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2446,14 +2456,16 @@ pmdie(SIGNAL_ARGS)
 				signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgReclaimerPID != 0)
+				signal_child(BgReclaimerPID, SIGTERM);
 			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
 				/*
-				 * Only startup, bgwriter, walreceiver, unconnected bgworkers,
-				 * and/or checkpointer should be active in this state; we just
-				 * signaled the first four, and we don't want to kill
-				 * checkpointer yet.
+				 * Only startup, bgwriter, walreceiver, bgreclaimer,
+				 * unconnected bgworkers, and/or checkpointer should be
+				 * active in this state; we just signaled the first five,
+				 * and we don't want to kill checkpointer yet.
 				 */
 				pmState = PM_WAIT_BACKENDS;
 			}
@@ -2606,6 +2618,8 @@ reaper(SIGNAL_ARGS)
 				BgWriterPID = StartBackgroundWriter();
 			if (WalWriterPID == 0)
 				WalWriterPID = StartWalWriter();
+			if (BgReclaimerPID == 0)
+				BgReclaimerPID = StartBackgroundReclaimer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2631,7 +2645,8 @@ reaper(SIGNAL_ARGS)
 		/*
 		 * Was it the bgwriter?  Normal exit can be ignored; we'll start a new
 		 * one at the next iteration of the postmaster's main loop, if
-		 * necessary.  Any other exit condition is treated as a crash.
+		 * necessary.  Any other exit condition is treated as a crash.  Likewise
+		 * for bgreclaimer.
 		 */
 		if (pid == BgWriterPID)
 		{
@@ -2642,6 +2657,17 @@ reaper(SIGNAL_ARGS)
 			continue;
 		}
 
+		if (pid == BgReclaimerPID)
+		{
+			BgReclaimerPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background reclaimer process"));
+			continue;
+		}
+
+
+
 		/*
 		 * Was it the checkpointer?
 		 */
@@ -3003,7 +3029,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3207,6 +3233,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the bgreclaimer too */
+	if (pid == BgReclaimerPID)
+		BgReclaimerPID = 0;
+	else if (BgReclaimerPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) BgReclaimerPID)));
+		signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/*
 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
 	 * necessary, but it seems like a good idea for robustness, and it
@@ -3377,14 +3415,14 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
 		 * (including autovac workers), no bgworkers (including unconnected
-		 * ones), and no walwriter, autovac launcher or bgwriter.  If we are
-		 * doing crash recovery or an immediate shutdown then we expect the
-		 * checkpointer to exit as well, otherwise not. The archiver, stats,
-		 * and syslogger processes are disregarded since they are not
-		 * connected to shared memory; we also disregard dead_end children
-		 * here. Walsenders are also disregarded, they will be terminated
-		 * later after writing the checkpoint record, like the archiver
-		 * process.
+		 * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+		 * If we are doing crash recovery or an immediate shutdown then we
+		 * expect the checkpointer to exit as well, otherwise not. The
+		 * archiver, stats, and syslogger processes are disregarded since they
+		 * are not connected to shared memory; we also disregard dead_end
+		 * children here. Walsenders are also disregarded, they will be
+		 * terminated later after writing the checkpoint record, like the
+		 * archiver process.
 		 */
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
 			CountUnconnectedWorkers() == 0 &&
@@ -3394,7 +3432,8 @@ PostmasterStateMachine(void)
 			(CheckpointerPID == 0 ||
 			 (!FatalError && Shutdown < ImmediateShutdown)) &&
 			WalWriterPID == 0 &&
-			AutoVacPID == 0)
+			AutoVacPID == 0 &&
+			BgReclaimerPID == 0)
 		{
 			if (Shutdown >= ImmediateShutdown || FatalError)
 			{
@@ -3492,6 +3531,7 @@ PostmasterStateMachine(void)
 			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
+			Assert(BgReclaimerPID == 0);
 			/* syslogger is not considered here */
 			pmState = PM_NO_CHILDREN;
 		}
@@ -3704,6 +3744,8 @@ TerminateChildren(int signal)
 		signal_child(WalReceiverPID, signal);
 	if (AutoVacPID != 0)
 		signal_child(AutoVacPID, signal);
+	if (BgReclaimerPID != 0)
+		signal_child(BgReclaimerPID, signal);
 	if (PgArchPID != 0)
 		signal_child(PgArchPID, signal);
 	if (PgStatPID != 0)
@@ -4783,6 +4825,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		CheckpointerPID = StartCheckpointer();
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(BgReclaimerPID == 0);
+		BgReclaimerPID = StartBackgroundReclaimer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -5127,6 +5171,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 						(errmsg("could not fork WAL receiver process: %m")));
 				break;
+			case BgReclaimerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork background writer process: %m")));
+				break;
 			default:
 				ereport(LOG,
 						(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 45d1d61..559d393 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1925,6 +1925,92 @@ BgBufferSync(void)
 }
 
 /*
+ * Move buffers with reference and a usage_count of zero to freelist. By
+ * maintaining enough buffers in the freelist (up to the list's high water
+ * mark), we drastically reduce the likelihood of individual backends
+ * having to perform the clock sweep themselves.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist falls below low water mark of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+	uint32	num_needed_on_freelist = 0;
+	uint32	recent_alloc = 0;
+	uint32  recent_backend_clocksweep = 0;
+	volatile uint32	next_victim = 0;
+
+	/* Execute the clock sweep */
+	for (;;)
+	{
+		uint32	tmp_num_needed_on_freelist;
+		uint32	tmp_recent_alloc;
+		uint32  tmp_recent_backend_clocksweep;
+
+		StrategyGetFreelistAccessInfo(&tmp_num_needed_on_freelist,
+									  &tmp_recent_alloc,
+									  &tmp_recent_backend_clocksweep);
+
+		num_needed_on_freelist += tmp_num_needed_on_freelist;
+		recent_alloc += tmp_recent_alloc;
+		recent_backend_clocksweep += tmp_recent_backend_clocksweep;
+
+		if (tmp_num_needed_on_freelist == 0)
+			break;
+
+		while (tmp_num_needed_on_freelist > 0)
+		{
+			volatile BufferDesc *bufHdr;
+			bool	add_to_freelist = false;
+
+			/*
+			 * Choose next victim buffer to look if that can be moved
+			 * to freelist.
+			 */
+			StrategySyncNextVictimBuffer(&next_victim);
+
+			bufHdr = &BufferDescriptors[next_victim];
+
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot
+			 * move it to freelist; decrement the usage_count (unless pinned)
+			 * and keep scanning.
+			 */
+			LockBufHdr(bufHdr);
+			if (bufHdr->refcount == 0)
+			{
+				if (bufHdr->usage_count > 0)
+					bufHdr->usage_count--;
+				else
+					add_to_freelist = true;
+			}
+			UnlockBufHdr(bufHdr);
+
+			if (add_to_freelist && StrategyMoveBufferToFreeListTail(bufHdr))
+				tmp_num_needed_on_freelist--;
+		}
+
+		/*
+		 * Report buffer alloc and buffer request not satisfied
+		 * from freelist counts to pgstat.
+		 */
+		BgWriterStats.m_buf_alloc += recent_alloc;
+		BgWriterStats.m_buf_backend_clocksweep += recent_backend_clocksweep;
+
+		/*
+		 * Send off activity statistics to the stats collector
+		 */
+		pgstat_send_bgwriter();
+	}
+
+#ifdef BGW_DEBUG
+	elog(DEBUG1, "bgreclaimer: recent_alloc=%u recent_backend_clocksweep =%d next_victim=%d num_freed=%u",
+		 recent_alloc, recent_backend_clocksweep, next_victim, num_needed_on_freelist);
+#endif
+}
+
+/*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
  * If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5966beb..ec1caf9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -32,6 +32,7 @@ typedef struct
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
+	int			numFreeListBuffers; /* number of buffers on freelist */
 
 	/*
 	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -40,19 +41,42 @@ typedef struct
 
 	/*
 	 * Statistics.  These counters should be wide enough that they can't
-	 * overflow during a single bgwriter cycle.
+	 * overflow during a single bgwriter cycle.  completePasses is only
+	 * recorded by bgwriter, numBufferBackendClocksweep is only recorded
+	 * by bgreclaimer, however numBufferAllocs is recorded by both bgwriter
+	 * and bgreclaimer.
 	 */
 	uint32		completePasses; /* Complete cycles of the clock sweep */
 	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
 
+	/* Buffers not statistied from freelist since last reset */
+	uint32		numBufferBackendClocksweep;
+
+	/*
+	 * Protects nextVictimBuffer and completePasses. We need separate
+	 * lock to protect victim buffer and completePasses so that
+	 * clock sweep of one backend doesn't contend with another backend
+	 * which is evicting buffer from freelist.  We can consider having
+	 * victimbuf_lck and freelist_lck in separate cache lines by keeping
+	 * them apart in structure and by adding padding bytes, however at
+	 * the moment there is no proof that having them in same cache line
+	 * hits the performance in any scenario.
+	 */
+	slock_t	     victimbuf_lck;
+
 	/*
 	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
 	 */
 	Latch	   *bgwriterLatch;
+
+	/*
+	 * Latch to wake bgreclaimer.
+	 */
+	Latch	   *bgreclaimerLatch;
 } BufferStrategyControl;
 
 /* Pointers to shared state */
-static BufferStrategyControl *StrategyControl = NULL;
+static volatile BufferStrategyControl *StrategyControl = NULL;
 
 /*
  * Private (non-shared) state for managing a ring of shared buffers to re-use.
@@ -87,6 +111,26 @@ typedef struct BufferAccessStrategyData
 	Buffer		buffers[1];		/* VARIABLE SIZE ARRAY */
 }	BufferAccessStrategyData;
 
+/*
+ * Water mark indicators for maintaining buffers on freelist.  When the
+ * number of buffers on freelist drops below the low water mark, the
+ * allocating backend sets the latch and bgreclaimer wakes up and begins
+ * adding buffers to the freelist until it reaches the high water mark and
+ * then again goes back to sleep.
+ */
+int freelistLowWaterMark;
+int freelistHighWaterMark;
+
+/*
+ * Percentage indicators for maintaining buffers on freelist.
+ * High water mark is percentage of total number of buffers (NBuffers).
+ * and Low water mark is percentage of the high water mark.
+ */
+#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.005
+#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT	0.2
+#define MIN_HIGH_WATER_MARK	5
+#define MAX_HIGH_WATER_MARK	2000
+
 
 /* Prototypes for internal functions */
 static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -109,8 +153,10 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
 volatile BufferDesc *
 StrategyGetBuffer(BufferAccessStrategy strategy)
 {
-	volatile BufferDesc *buf;
+	volatile BufferDesc *buf = NULL;
 	Latch	   *bgwriterLatch;
+	Latch	   *bgreclaimerLatch;
+	int			numFreeListBuffers;
 	int			trycounter;
 
 	/*
@@ -128,34 +174,24 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 
 	/*
-	 * We count buffer allocation requests so that the bgwriter can estimate
-	 * the rate of buffer consumption.  Note that buffers recycled by a
-	 * strategy object are intentionally not counted here.
+	 * We count buffer allocation requests so that the bgwriter or bgreclaimer
+	 * can estimate the rate of buffer consumption.  Note that buffers recycled
+	 * by a strategy object are intentionally not counted here.
 	 */
 	StrategyControl->numBufferAllocs++;
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding buffer_strategy_lock; so release and re-grab.
-	 * This is annoyingly tedious, but it happens at most once per bgwriter
-	 * cycle, so the performance hit is minimal.
+	 * Remember the values of bgwriter and bgreclaimer latch so that they can
+	 * be set outside spin lock and try to get a buffer from the freelist.
 	 */
+	bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
 	bgwriterLatch = StrategyControl->bgwriterLatch;
 	if (bgwriterLatch)
-	{
 		StrategyControl->bgwriterLatch = NULL;
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
-		SetLatch(bgwriterLatch);
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
-	}
 
-	/*
-	 * Try to get a buffer from the freelist.  Note that the freeNext fields
-	 * are considered to be protected by the buffer_strategy_lock not the
-	 * individual buffer spinlocks, so it's OK to manipulate them without
-	 * holding the spinlock.
-	 */
-	while (StrategyControl->firstFreeBuffer >= 0)
+	numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+	if (StrategyControl->firstFreeBuffer >= 0)
 	{
 		buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
 		Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -163,47 +199,81 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 		/* Unconditionally remove buffer from freelist */
 		StrategyControl->firstFreeBuffer = buf->freeNext;
 		buf->freeNext = FREENEXT_NOT_IN_LIST;
+		--StrategyControl->numFreeListBuffers;
+	}
+	else
+		StrategyControl->numBufferBackendClocksweep++;
 
-		/*
-		 * Release the lock so someone else can access the freelist (or run
-		 * the clocksweep) while we check out this buffer.
-		 */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 
+	/* If bgwriterLatch is set, we need to waken the bgwriter */
+	if (bgwriterLatch)
+		SetLatch(bgwriterLatch);
+
+	/*
+	 * If the number of free buffers has fallen below the low water mark,
+	 * awaken the bgreclaimer to repopulate it.  bgreclaimerLatch is initialized in
+	 * early phase of BgReclaimer startup, however we still check before using
+	 * it to avoid any problem incase we reach here before its initializion.
+	 */
+	if (numFreeListBuffers < freelistLowWaterMark  && bgreclaimerLatch)
+		SetLatch(StrategyControl->bgreclaimerLatch);
+
+	if (buf != NULL)
+	{
 		/*
-		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
-		 * it; discard it and retry.  (This can only happen if VACUUM put a
-		 * valid buffer in the freelist and then someone else used it before
-		 * we got to it.  It's probably impossible altogether as of 8.3, but
-		 * we'd better check anyway.)
+		 * Try to get a buffer from the freelist.  Note that the freeNext fields
+		 * are considered to be protected by the freelist_lck not the
+		 * individual buffer spinlocks, so it's OK to manipulate them without
+		 * holding the buffer spinlock.
 		 */
-		LockBufHdr(buf);
-		if (buf->refcount == 0 && buf->usage_count == 0)
+		for(;;)
 		{
-			if (strategy != NULL)
-				AddBufferToRing(strategy, buf);
-			return buf;
-		}
-		UnlockBufHdr(buf);
+			/*
+			 * If the buffer is pinned or has a nonzero usage_count, we cannot use
+			 * it; discard it and retry.
+			 */
+			LockBufHdr(buf);
+			if (buf->refcount == 0 && buf->usage_count == 0)
+			{
+				if (strategy != NULL)
+					AddBufferToRing(strategy, buf);
+				return buf;
+			}
+			UnlockBufHdr(buf);
 
-		/* Reacquire the lock and go around for another pass. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+			if (StrategyControl->firstFreeBuffer >= 0)
+			{
+				buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+				Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+				/* Unconditionally remove buffer from freelist */
+				StrategyControl->firstFreeBuffer = buf->freeNext;
+				buf->freeNext = FREENEXT_NOT_IN_LIST;
+				--StrategyControl->numFreeListBuffers;
+
+				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			}
+			else
+			{
+				StrategyControl->numBufferBackendClocksweep++;
+				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+				break;
+			}
+		}
 	}
 
 	/* Nothing on the freelist, so run the "clock sweep" algorithm */
 	trycounter = NBuffers;
 	for (;;)
 	{
-		buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
+		volatile uint32	next_victim;
 
-		if (++StrategyControl->nextVictimBuffer >= NBuffers)
-		{
-			StrategyControl->nextVictimBuffer = 0;
-			StrategyControl->completePasses++;
-		}
+		StrategySyncNextVictimBuffer(&next_victim);
 
-		/* Release the lock before manipulating the candidate buffer. */
-		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+		buf = &BufferDescriptors[next_victim];
 
 		/*
 		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -238,9 +308,6 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			elog(ERROR, "no unpinned buffers available");
 		}
 		UnlockBufHdr(buf);
-
-		/* Reacquire the lock and get a new candidate buffer. */
-		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 	}
 }
 
@@ -268,6 +335,44 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
 }
 
 /*
+ * StrategyMoveBufferToFreeListTail: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf)
+{
+	bool		freed = false;
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+	/*
+	 * It is possible that we are told to put something in the freelist that
+	 * is already in it; don't screw up the list if so.
+	 */
+	if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+	{
+		++StrategyControl->numFreeListBuffers;
+		freed = true;
+		/*
+		 * put the buffer on end of list and if list is empty then
+		 * assign first and last freebuffer with this buffer id.
+		 */
+		buf->freeNext = FREENEXT_END_OF_LIST;
+		if (StrategyControl->firstFreeBuffer < 0)
+		{
+			StrategyControl->firstFreeBuffer = buf->buf_id;
+			StrategyControl->lastFreeBuffer = buf->buf_id;
+			SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+			return freed;
+		}
+		BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+		StrategyControl->lastFreeBuffer = buf->buf_id;
+	}
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	return freed;
+}
+
+
+/*
  * StrategySyncStart -- tell BufferSync where to start syncing
  *
  * The result is the buffer index of the best buffer to sync first.
@@ -283,20 +388,84 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
 {
 	int			result;
 
-	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
 	result = StrategyControl->nextVictimBuffer;
 	if (complete_passes)
 		*complete_passes = StrategyControl->completePasses;
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
 	if (num_buf_alloc)
 	{
+		SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
 		*num_buf_alloc = StrategyControl->numBufferAllocs;
 		StrategyControl->numBufferAllocs = 0;
+		SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	}
-	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
 	return result;
 }
 
 /*
+ * StrategyGetFreelistAccessInfo -- get information required by bgreclaimer
+ * to move unused buffers to freelist.
+ *
+ * The result is the number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs and buffer allocs not
+ * satisfied from freelist.
+ */
+void
+StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc,
+							  uint32 *num_buf_backend_clocksweep)
+{
+	int			curfreebuffers;
+
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	curfreebuffers = StrategyControl->numFreeListBuffers;
+	if (curfreebuffers < freelistHighWaterMark)
+		*num_buf_to_free = freelistHighWaterMark - curfreebuffers;
+	else
+		*num_buf_to_free = 0;
+
+	*num_buf_alloc = StrategyControl->numBufferAllocs;
+	StrategyControl->numBufferAllocs = 0;
+
+	*num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep;
+	StrategyControl->numBufferBackendClocksweep = 0;
+
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+
+	return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+	SpinLockAcquire(&StrategyControl->victimbuf_lck);
+	*next_victim_buffer = StrategyControl->nextVictimBuffer;
+	if (++StrategyControl->nextVictimBuffer >= NBuffers)
+	{
+		StrategyControl->nextVictimBuffer = 0;
+		StrategyControl->completePasses++;
+	}
+	SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+	SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+	StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+	SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
+
+/*
  * StrategyNotifyBgWriter -- set or clear allocation notification latch
  *
  * If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -387,6 +556,7 @@ StrategyInitialize(bool init)
 		 */
 		StrategyControl->firstFreeBuffer = 0;
 		StrategyControl->lastFreeBuffer = NBuffers - 1;
+		StrategyControl->numFreeListBuffers = NBuffers;
 
 		/* Initialize the clock sweep pointer */
 		StrategyControl->nextVictimBuffer = 0;
@@ -394,12 +564,32 @@ StrategyInitialize(bool init)
 		/* Clear statistics */
 		StrategyControl->completePasses = 0;
 		StrategyControl->numBufferAllocs = 0;
+		StrategyControl->numBufferBackendClocksweep = 0;
 
 		/* No pending notification */
 		StrategyControl->bgwriterLatch = NULL;
+		SpinLockInit(&StrategyControl->victimbuf_lck);
 	}
 	else
 		Assert(!init);
+
+	/*
+	 * Initialize the low and high water mark number of buffer's
+	 * for freelist.  This is used to maintain buffer's on freelist
+	 * so that backend doesn't often need to perform clock sweep to
+	 * find the buffer.  We need to maintain enough buffers so that
+	 * requests can be satisfied from freelist.  These numbers
+	 * are based on results of benchmarks at various workloads.
+	 */
+	freelistHighWaterMark = HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+							NBuffers;
+	if (freelistHighWaterMark < MIN_HIGH_WATER_MARK)
+		freelistHighWaterMark = MIN_HIGH_WATER_MARK;
+	else if (freelistHighWaterMark > MAX_HIGH_WATER_MARK)
+		freelistHighWaterMark = MAX_HIGH_WATER_MARK;
+
+	freelistLowWaterMark = LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT *
+						   freelistHighWaterMark;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 44ccd37..00d815f 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -100,6 +100,7 @@ extern Datum pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_buf_alloc(PG_FUNCTION_ARGS);
+extern Datum pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS);
 
 extern Datum pg_stat_get_xact_numscans(PG_FUNCTION_ARGS);
 extern Datum pg_stat_get_xact_tuples_returned(PG_FUNCTION_ARGS);
@@ -1496,6 +1497,12 @@ pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 }
 
 Datum
+pg_stat_get_buf_clocksweep_backend(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_INT64(pgstat_fetch_global()->buf_backend_clocksweep);
+}
+
+Datum
 pg_stat_get_xact_numscans(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 4736532..2a1f9c2 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2783,6 +2783,8 @@ DATA(insert OID = 3063 ( pg_stat_get_buf_fsync_backend PGNSP PGUID 12 1 0 0 0 f
 DESCR("statistics: number of backend buffer writes that did their own fsync");
 DATA(insert OID = 2859 ( pg_stat_get_buf_alloc			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_alloc _null_ _null_ _null_ ));
 DESCR("statistics: number of buffer allocations");
+DATA(insert OID = 4149 ( pg_stat_get_buf_clocksweep_backend			PGNSP PGUID 12 1 0 0 0 f f f f t f s 0 0 20 "" _null_ _null_ _null_ _null_ pg_stat_get_buf_clocksweep_backend _null_ _null_ _null_ ));
+DESCR("statistics: number of buffer allocations not satisfied from freelsit");
 
 DATA(insert OID = 2978 (  pg_stat_get_function_calls		PGNSP PGUID 12 1 0 0 0 f f f f t f s 1 0 20 "26" _null_ _null_ _null_ _null_ pg_stat_get_function_calls _null_ _null_ _null_ ));
 DESCR("statistics: number of function calls");
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2ba9885..a09eab3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -367,6 +367,7 @@ typedef enum
 	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
+	BgReclaimerProcess,
 
 	NUM_AUXPROCTYPES			/* Must be last! */
 } AuxProcType;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..51a2023 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
 	PgStat_Counter m_buf_written_backend;
 	PgStat_Counter m_buf_fsync_backend;
 	PgStat_Counter m_buf_alloc;
+	PgStat_Counter m_buf_backend_clocksweep;
 	PgStat_Counter m_checkpoint_write_time;		/* times in milliseconds */
 	PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BC9D
 
 /* ----------
  * PgStat_StatDBEntry			The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
 	PgStat_Counter buf_written_backend;
 	PgStat_Counter buf_fsync_backend;
 	PgStat_Counter buf_alloc;
+	PgStat_Counter buf_backend_clocksweep;
 	TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ *	  POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif   /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0e69b63..3823fc4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -187,10 +187,16 @@ extern BufferDesc *LocalBufferDescriptors;
 /* freelist.c */
 extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
 extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListTail(volatile BufferDesc *buf);
 extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
 					 volatile BufferDesc *buf);
 
 extern int	StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free,
+										  uint32 *num_buf_alloc,
+										  uint32 *num_buf_backend_clocksweep);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
+extern void StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch);
 extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
 
 extern Size StrategyShmemSize(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 42d9120..da4f837 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -200,6 +200,7 @@ extern void AbortBufferIO(void);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
 
 extern void AtProcExit_LocalBuffers(void);
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation.  Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c79b45c..8748d70 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1698,6 +1698,7 @@ pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints
     pg_stat_get_buf_written_backend() AS buffers_backend,
     pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
     pg_stat_get_buf_alloc() AS buffers_alloc,
+    pg_stat_get_buf_clocksweep_backend() AS buffers_backend_clocksweep,
     pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 pg_stat_database| SELECT d.oid AS datid,
     d.datname,

#121

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#120)

Re: Scaling shared buffer eviction

On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections =300
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

patch_ver/client_count 1 8 32 64 128 256
HEAD 18884 118628 251093 216294 186625 177505
PATCH 18743 122578 247243 205521 179712 175031

So, pretty much no benefits on any scale, right?

Here we can see that the performance dips at higher client
count(>=32) which was quite surprising for me, as I was expecting
it to improve, because bgreclaimer reduces the contention by making
buffers available on free list. So I tried to analyze the situation by
using perf and found that in above configuration, there is a contention
around freelist spinlock with HEAD and the same is removed by Patch,
but still the performance goes down with Patch. On further analysis, I
observed that actually after Patch there is an increase in contention
around ProcArrayLock (shared LWlock) via GetSnapshotData which
sounds bit odd, but that's what I can see in profiles. Based on analysis,
few ideas which I would like to further investigate are:
a. As there is an increase in spinlock contention, I would like to check
with Andres's latest patch which reduces contention around shared
lwlocks.
b. Reduce some instructions added by patch in StrategyGetBuffer(),
like instead of awakening bgreclaimer at low threshold, awaken when
it tries to do clock sweep.

Are you sure you didn't mix up the profiles here? The head vs. patched
look more like profiles from different client counts than different
versions of the code.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#121)

Re: Scaling shared buffer eviction

On Thu, Oct 9, 2014 at 7:31 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections =300
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

patch_ver/client_count 1 8 32 64 128 256
HEAD 18884 118628 251093 216294 186625 177505
PATCH 18743 122578 247243 205521 179712 175031

So, pretty much no benefits on any scale, right?

Almost Right, there seem to be slight benefit at client count
8, however that can be due to variation as well.

Here we can see that the performance dips at higher client
count(>=32) which was quite surprising for me, as I was expecting
it to improve, because bgreclaimer reduces the contention by making
buffers available on free list. So I tried to analyze the situation by
using perf and found that in above configuration, there is a contention
around freelist spinlock with HEAD and the same is removed by Patch,
but still the performance goes down with Patch. On further analysis, I
observed that actually after Patch there is an increase in contention
around ProcArrayLock (shared LWlock) via GetSnapshotData which
sounds bit odd, but that's what I can see in profiles. Based on

analysis,

few ideas which I would like to further investigate are:
a. As there is an increase in spinlock contention, I would like to

check

with Andres's latest patch which reduces contention around shared
lwlocks.
b. Reduce some instructions added by patch in StrategyGetBuffer(),
like instead of awakening bgreclaimer at low threshold, awaken when
it tries to do clock sweep.

Are you sure you didn't mix up the profiles here?

I have tried this 2 times, basically I am quite confident from myside,
but human errors can't be ruled out. I have used below statements:

Steps used for profiling
during configure, use CFLAS="-fno-omit-frame-pointer"
Terminal -1
Start Server
Terminal -2
./pgbench -c 64 -j 64 -T 300 -S -M prepared postgres
Terminal-3
perf record -a -g sleep 60 --This command is run after a minute or so of
-- test start

After test is finished - perf report -g graph,0.5,callee

Do you see any problem in the way I am collecting perf reports?

In any case, I can try once more if you still doubt the profiles.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#123

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Andres Freund (#121)

2 attachment(s)

Re: Scaling shared buffer eviction

On 2014-10-09 16:01:55 +0200, Andres Freund wrote:

On 2014-10-09 18:17:09 +0530, Amit Kapila wrote:

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Performance Data
----------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections =300
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 5000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

As an example. The attached scripts go from:
progress: 5.3 s, 20.9 tps, lat 368.917 ms stddev 49.655
progress: 10.1 s, 21.0 tps, lat 380.326 ms stddev 64.525
progress: 15.1 s, 14.1 tps, lat 568.108 ms stddev 226.040
progress: 20.4 s, 12.0 tps, lat 634.557 ms stddev 300.519
progress: 25.2 s, 17.5 tps, lat 461.738 ms stddev 136.257
progress: 30.2 s, 9.8 tps, lat 850.766 ms stddev 305.454
progress: 35.3 s, 12.2 tps, lat 670.473 ms stddev 271.075
progress: 40.2 s, 7.9 tps, lat 972.617 ms stddev 313.152
progress: 45.3 s, 14.9 tps, lat 546.056 ms stddev 211.987
progress: 50.2 s, 13.2 tps, lat 610.608 ms stddev 271.780
progress: 55.5 s, 16.9 tps, lat 468.757 ms stddev 156.516
progress: 60.5 s, 14.3 tps, lat 548.913 ms stddev 190.414
progress: 65.7 s, 9.3 tps, lat 821.293 ms stddev 353.665
progress: 70.1 s, 16.0 tps, lat 524.240 ms stddev 174.903
progress: 75.2 s, 17.0 tps, lat 485.692 ms stddev 194.273
progress: 80.2 s, 19.9 tps, lat 396.295 ms stddev 78.891
progress: 85.3 s, 18.3 tps, lat 423.744 ms stddev 105.798
progress: 90.1 s, 14.5 tps, lat 577.373 ms stddev 270.914
progress: 95.3 s, 12.0 tps, lat 649.434 ms stddev 247.001
progress: 100.3 s, 14.6 tps, lat 563.693 ms stddev 275.236
tps = 14.812222 (including connections establishing)

to:
progress: 5.1 s, 18.9 tps, lat 409.766 ms stddev 75.032
progress: 10.3 s, 20.2 tps, lat 396.781 ms stddev 67.593
progress: 15.1 s, 19.1 tps, lat 418.545 ms stddev 109.431
progress: 20.3 s, 20.6 tps, lat 388.606 ms stddev 74.259
progress: 25.1 s, 19.5 tps, lat 406.591 ms stddev 109.050
progress: 30.0 s, 19.1 tps, lat 420.199 ms stddev 157.005
progress: 35.0 s, 18.4 tps, lat 421.102 ms stddev 124.019
progress: 40.3 s, 12.3 tps, lat 640.640 ms stddev 88.409
progress: 45.2 s, 12.8 tps, lat 586.471 ms stddev 145.543
progress: 50.5 s, 6.9 tps, lat 1116.603 ms stddev 285.479
progress: 56.2 s, 6.3 tps, lat 1349.055 ms stddev 381.095
progress: 60.6 s, 7.9 tps, lat 1083.745 ms stddev 452.386
progress: 65.0 s, 9.6 tps, lat 805.981 ms stddev 273.845
progress: 71.1 s, 9.6 tps, lat 798.273 ms stddev 184.108
progress: 75.2 s, 9.3 tps, lat 950.131 ms stddev 150.870
progress: 80.8 s, 8.6 tps, lat 899.389 ms stddev 135.090
progress: 85.3 s, 8.8 tps, lat 928.183 ms stddev 152.056
progress: 90.9 s, 8.0 tps, lat 929.737 ms stddev 71.155
progress: 95.7 s, 9.0 tps, lat 968.070 ms stddev 127.824
progress: 100.3 s, 8.7 tps, lat 911.767 ms stddev 130.697

just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
hope that with an approach like your's this might become better.

psql -f /tmp/prepare.sql
pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#124

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Andres Freund (#123)

Re: Scaling shared buffer eviction

On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-09 16:01:55 +0200, Andres Freund wrote:

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
hope that with an approach like your's this might become better.

psql -f /tmp/prepare.sql
pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Thanks for providing the scripts. You haven't specified how much data
is present in 'large' file used in tests. I have tried with different set
of
rows, but I could not see the dip that is present in your data when you
increased shared buffers from 1GB to 8GB, also I don't see any difference
with patch. BTW, why do you think that for such worklaods this patch can
be helpful, according to my understanding it can be helpful mainly for
read mostly workloads when all the data doesn't fit in shared buffers.

Performance Data
-----------------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB

For 500000 rows
----------------------------
Data populated using below statement:
insert into largedata_64
('aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbb',generate_series(1,500000));
copy largedata_64 to '/tmp/large' binary;

pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

shared_buffers - 1GB
---------------------------------
progress: 7.0 s, 2.7 tps, lat 2326.645 ms stddev 173.506
progress: 11.5 s, 3.5 tps, lat 2295.577 ms stddev 78.949
progress: 15.8 s, 3.7 tps, lat 2298.217 ms stddev 223.346
progress: 20.4 s, 3.5 tps, lat 2350.187 ms stddev 192.312
progress: 25.1 s, 3.4 tps, lat 2280.206 ms stddev 54.580
progress: 31.9 s, 3.4 tps, lat 2408.593 ms stddev 243.230
progress: 45.2 s, 1.1 tps, lat 5120.151 ms stddev 3913.561
progress: 50.5 s, 1.3 tps, lat 8967.954 ms stddev 3384.229
progress: 52.7 s, 2.7 tps, lat 3883.788 ms stddev 1733.293
progress: 55.6 s, 3.2 tps, lat 2684.282 ms stddev 348.615
progress: 58.2 s, 3.4 tps, lat 2602.355 ms stddev 268.718
progress: 60.8 s, 3.1 tps, lat 2361.937 ms stddev 302.643
progress: 65.3 s, 3.5 tps, lat 2341.903 ms stddev 162.338
progress: 74.1 s, 2.6 tps, lat 2720.182 ms stddev 716.425
progress: 76.4 s, 3.5 tps, lat 3023.234 ms stddev 670.473
progress: 80.4 s, 2.0 tps, lat 2795.323 ms stddev 820.429
progress: 85.6 s, 1.9 tps, lat 4756.217 ms stddev 844.284
progress: 91.9 s, 2.2 tps, lat 3996.001 ms stddev 1301.143
progress: 96.6 s, 3.5 tps, lat 2284.419 ms stddev 85.013
progress: 101.1 s, 3.5 tps, lat 2282.848 ms stddev 71.388
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 275
latency average: 2939.784 ms
latency stddev: 1739.974 ms
tps = 2.710138 (including connections establishing)
tps = 2.710208 (excluding connections establishing)

shared_buffers - 8GB
------------------------------------
progress: 6.7 s, 2.7 tps, lat 2349.816 ms stddev 212.889
progress: 11.0 s, 3.5 tps, lat 2257.364 ms stddev 141.148
progress: 15.2 s, 3.8 tps, lat 2209.669 ms stddev 127.101
progress: 21.7 s, 3.7 tps, lat 2159.838 ms stddev 92.205
progress: 25.8 s, 3.9 tps, lat 2221.072 ms stddev 283.362
progress: 30.1 s, 3.5 tps, lat 2179.611 ms stddev 152.741
progress: 39.3 s, 2.1 tps, lat 2768.609 ms stddev 1265.508
progress: 50.9 s, 1.1 tps, lat 9361.388 ms stddev 2657.885
progress: 52.9 s, 1.0 tps, lat 2036.098 ms stddev 3.599
progress: 55.2 s, 4.3 tps, lat 2167.688 ms stddev 91.183
progress: 57.6 s, 3.0 tps, lat 2399.219 ms stddev 173.535
progress: 60.2 s, 4.1 tps, lat 2427.273 ms stddev 198.698
progress: 65.2 s, 3.4 tps, lat 2441.630 ms stddev 123.773
progress: 72.4 s, 2.9 tps, lat 2534.631 ms stddev 254.162
progress: 75.0 s, 3.9 tps, lat 2468.266 ms stddev 221.969
progress: 82.3 s, 3.0 tps, lat 2548.690 ms stddev 404.852
progress: 86.7 s, 1.4 tps, lat 3980.576 ms stddev 1205.743
progress: 92.5 s, 1.4 tps, lat 5174.340 ms stddev 643.415
progress: 97.1 s, 3.7 tps, lat 3252.847 ms stddev 1689.268
progress: 101.8 s, 3.4 tps, lat 2346.690 ms stddev 138.251
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 284
latency average: 2856.195 ms
latency stddev: 1740.699 ms
tps = 2.781603 (including connections establishing)
tps = 2.781682 (excluding connections establishing)

For 5000 rows
------------------------
shared_buffers - 1GB
-----------------------------------
progress: 5.0 s, 357.7 tps, lat 22.295 ms stddev 3.511
progress: 10.0 s, 339.0 tps, lat 23.606 ms stddev 4.388
progress: 15.0 s, 323.4 tps, lat 24.733 ms stddev 5.001
progress: 20.0 s, 329.6 tps, lat 24.258 ms stddev 4.407
progress: 25.0 s, 334.3 tps, lat 23.963 ms stddev 4.126
progress: 30.0 s, 337.5 tps, lat 23.699 ms stddev 3.492
progress: 35.2 s, 158.6 tps, lat 37.182 ms stddev 189.946
progress: 40.3 s, 3.9 tps, lat 2587.129 ms stddev 762.231
progress: 45.4 s, 2.4 tps, lat 2525.946 ms stddev 942.428
progress: 50.0 s, 303.7 tps, lat 33.719 ms stddev 137.524
progress: 55.0 s, 331.5 tps, lat 24.122 ms stddev 3.806
progress: 60.0 s, 333.2 tps, lat 24.028 ms stddev 3.340
progress: 65.0 s, 336.1 tps, lat 23.802 ms stddev 3.601
progress: 70.0 s, 209.0 tps, lat 38.263 ms stddev 120.198
progress: 75.2 s, 141.2 tps, lat 54.350 ms stddev 168.274
progress: 80.0 s, 331.0 tps, lat 25.262 ms stddev 31.637
progress: 86.0 s, 10.9 tps, lat 721.991 ms stddev 750.484
progress: 90.0 s, 85.3 tps, lat 95.531 ms stddev 411.560
progress: 95.0 s, 318.0 tps, lat 25.152 ms stddev 3.985
progress: 100.5 s, 241.1 tps, lat 33.061 ms stddev 83.705
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 24068
latency average: 33.400 ms
latency stddev: 138.159 ms
tps = 239.444193 (including connections establishing)
tps = 239.450511 (excluding connections establishing)

shared_buffers-8GB
---------------------------------
progress: 5.0 s, 339.3 tps, lat 23.514 ms stddev 3.853
progress: 10.0 s, 332.7 tps, lat 24.033 ms stddev 3.850
progress: 15.0 s, 329.7 tps, lat 24.290 ms stddev 3.236
progress: 20.0 s, 323.7 tps, lat 24.718 ms stddev 3.639
progress: 25.0 s, 338.0 tps, lat 23.650 ms stddev 2.916
progress: 30.0 s, 324.0 tps, lat 24.721 ms stddev 3.365
progress: 36.1 s, 56.7 tps, lat 127.433 ms stddev 530.344
progress: 41.0 s, 3.2 tps, lat 2393.639 ms stddev 469.533
progress: 45.0 s, 91.8 tps, lat 104.049 ms stddev 418.744
progress: 50.0 s, 331.4 tps, lat 24.143 ms stddev 2.398
progress: 55.0 s, 332.7 tps, lat 24.067 ms stddev 2.810
progress: 60.0 s, 331.1 tps, lat 24.136 ms stddev 3.449
progress: 65.0 s, 304.0 tps, lat 26.332 ms stddev 33.693
progress: 70.9 s, 227.6 tps, lat 34.153 ms stddev 121.504
progress: 75.0 s, 295.1 tps, lat 28.236 ms stddev 52.897
progress: 82.2 s, 44.0 tps, lat 160.993 ms stddev 632.587
progress: 85.0 s, 85.3 tps, lat 121.065 ms stddev 432.011
progress: 90.0 s, 325.9 tps, lat 24.581 ms stddev 2.545
progress: 95.0 s, 333.4 tps, lat 23.989 ms stddev 1.709
progress: 100.0 s, 292.2 tps, lat 27.330 ms stddev 41.678
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 8
duration: 100 s
number of transactions actually processed: 25039
latency average: 31.955 ms
latency stddev: 136.304 ms
tps = 250.328882 (including connections establishing)
tps = 250.335432 (excluding connections establishing)

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#125

Andres Freund

andres@2ndquadrant.com

over 11 years ago

In reply to: Amit Kapila (#124)

Re: Scaling shared buffer eviction

On 2014-10-10 12:28:13 +0530, Amit Kapila wrote:

On Fri, Oct 10, 2014 at 1:08 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-09 16:01:55 +0200, Andres Freund wrote:

I don't think OLTP really is the best test case for this. Especially not
pgbench with relatilvely small rows *and* a uniform distribution of
access.

Try parallel COPY TO. Batch write loads is where I've seen this hurt
badly.

just by switching shared_buffers from 1 to 8GB. I haven't tried, but I
hope that with an approach like your's this might become better.

psql -f /tmp/prepare.sql
pgbench -P5 -n -f /tmp/copy.sql -c 8 -j 8 -T 100

Thanks for providing the scripts. You haven't specified how much data
is present in 'large' file used in tests.

I don't think it matters much. It should be small enough that you get a
couple TPS over all backends.

I have tried with different set of
rows, but I could not see the dip that is present in your data when you
increased shared buffers from 1GB to 8GB, also I don't see any difference
with patch.

Interesting. I wonder whether that's because the concurrency wasn't high
enough for that machine to show the problem. I ran the test on my
workstation which has 8 actual cores.

BTW, why do you think that for such worklaods this patch can
be helpful, according to my understanding it can be helpful mainly for
read mostly workloads when all the data doesn't fit in shared buffers.

The performance dip comes from all the backends performing the clock
sweep. As the access is pretty uniform all the buffers start with some
usage count (IIRC 3 in this example. Much worse if 5). Due to the
uniform usagecount the clock sweep frequently has to go several times
through *all* the buffers. That leads to quite horrible performance in
some cases.
I had hoped that bgreclaimer can make that workload les horrible by
funneling most of the accesses through the freelist.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Amit Kapila (#120)

1 attachment(s)

Re: Scaling shared buffer eviction

On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

Below data is median of 3 runs.

patch_ver/client_count 1 8 32 64 128 256
HEAD 18884 118628 251093 216294 186625 177505
PATCH 18743 122578 247243 205521 179712 175031

Here we can see that the performance dips at higher client
count(>=32) which was quite surprising for me, as I was expecting
it to improve, because bgreclaimer reduces the contention by making
buffers available on free list. So I tried to analyze the situation by
using perf and found that in above configuration, there is a contention
around freelist spinlock with HEAD and the same is removed by Patch,
but still the performance goes down with Patch. On further analysis, I
observed that actually after Patch there is an increase in contention
around ProcArrayLock (shared LWlock) via GetSnapshotData which
sounds bit odd, but that's what I can see in profiles. Based on analysis,
few ideas which I would like to further investigate are:
a. As there is an increase in spinlock contention, I would like to check
with Andres's latest patch which reduces contention around shared
lwlocks.

I have tried by applying Andres's Wait free LW_SHARED acquisition
patch posted by him at below link along with bgreclaimer patch:
/messages/by-id/20141008133533.GA5053@alap3.anarazel.de

After that I observed that contention for LW_SHARED has reduced
for this load, but it didn't help much in terms of performance, so I again
rechecked the profile and this time most of the contention is moved
to spinlock used in dynahash for buf mapping tables, please refer
the profile (for 128 client count; Read only load) below:

bgreclaimer patch + wait free lw_shared acquisition patches -
------------------------------------------------------------------------------------------

    9.24%          swapper  [unknown]           [H] 0x00000000011d9c10
+   7.19%         postgres  postgres            [.] s_lock
+   3.52%         postgres  postgres            [.] GetSnapshotData
+   3.02%         postgres  postgres            [.] calc_bucket
+   2.71%         postgres  postgres            [.]
hash_search_with_hash_value
    2.32%         postgres  [unknown]           [H] 0x00000000011e0d7c
+   2.17%         postgres  postgres            [.]
pg_atomic_fetch_add_u32_impl
+   1.84%         postgres  postgres            [.] AllocSetAlloc
+   1.57%         postgres  postgres            [.] _bt_compare
+   1.05%         postgres  postgres            [.] AllocSetFreeIndex
+   1.02%         postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   0.94%         postgres  postgres            [.] tas
+   0.85%          swapper  [kernel.kallsyms]   [k] .int_sqrt
+   0.80%         postgres  postgres            [.] pg_encoding_mbcliplen
+   0.78%          pgbench  [kernel.kallsyms]   [k] .find_busiest_group
    0.65%          pgbench  [unknown]           [H] 0x00000000011d96e0
+   0.59%         postgres  postgres            [.] hash_any
+   0.54%         postgres  postgres            [.] LWLockRelease

Detailed Profile Data
-----------------------------------------
- 7.19% postgres postgres [.] s_lock

- s_lock

- 3.79% s_lock

- 1.69% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

- 1.63% hash_search_with_hash_value

- 1.63% BufTableDelete

BufferAlloc

ReadBuffer_common

ReadBufferExtended
- 1.45% hash_search_with_hash_value

- 1.45% hash_search_with_hash_value

BufTableDelete

BufferAlloc

ReadBuffer_common
- 1.43% get_hash_entry

- 1.43% get_hash_entry

hash_search_with_hash_value

BufTableInsert

BufferAlloc

ReadBuffer_common
- 3.52% postgres postgres [.] GetSnapshotData

- GetSnapshotData

- 3.50% GetSnapshotData

- 3.49% GetTransactionSnapshot

- 1.75% exec_bind_message

PostgresMain

- 1.74% PortalStart

exec_bind_message

PostgresMain

To reduce above contention, I tried to write a patch to replace spin lock
used in dynahash to manage free list by atomic operations. Still there
is work pending for this patch with respect to ensuring whether the
approach used in patch is completely sane, however I am posting the
patch so that others can have a look at it and give me feedback about
the approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

change_dynahash_table_impl_using_atomics_v1.patchapplication/octet-stream; name=change_dynahash_table_impl_using_atomics_v1.patchDownload

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 2b99e4b..64f25bf 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -71,6 +71,7 @@
 #include <limits.h>
 
 #include "access/xact.h"
+#include "port/atomics.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/dynahash.h"
@@ -387,6 +388,7 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 
 	if (flags & HASH_PARTITION)
 	{
+		Size	elementSize;
 		/* Doesn't make sense to partition a local hash table */
 		Assert(flags & HASH_SHARED_MEM);
 
@@ -398,6 +400,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 		Assert(info->num_partitions == next_pow2_int(info->num_partitions));
 
 		hctl->num_partitions = info->num_partitions;
+
+		/*
+		 * allocate the first element of freelist, as the same is expected
+		 * to be always present before any operation on freelist.
+		 */
+		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+		hctl->freeList = (HASHELEMENT *) hashp->alloc(elementSize);
 	}
 
 	if (flags & HASH_SEGMENT)
@@ -912,22 +921,52 @@ hash_search_with_hash_value(HTAB *hashp,
 				/* use volatile pointer to prevent code rearrangement */
 				volatile HASHHDR *hctlv = hctl;
 
-				/* if partitioned, must lock to touch nentries and freeList */
+				/*
+				 * if partitioned, must use atomic ops to touch nentries
+				 * and freeList
+				 */
 				if (IS_PARTITIONED(hctlv))
-					SpinLockAcquire(&hctlv->mutex);
-
-				Assert(hctlv->nentries > 0);
-				hctlv->nentries--;
+				{
+					uint64		freelistval;
+
+					pg_atomic_sub_fetch_u64((volatile pg_atomic_uint64*)&hctlv->nentries,1);
+
+					pg_atomic_write_u64((volatile pg_atomic_uint64*)prevBucketPtr, *(uint64*)currBucket);
+
+					/*
+					 * to add an entry to freelist, we need to save the freelist
+					 * head, and then add an entry to head and then finally using
+					 * compare and exchange point the freelist to head.  Now it is quite
+					 * possible that when it tries to point head of freelist to newly
+					 * added entry, the freelist head is already changed, so this
+					 * operation needs to be tried in loop.  One point to note here
+					 * is that the freed elements are always added to the second slot of
+					 * freelist as there doesn't seem to be a way with which we can add
+					 * it as first element using currently supportted atomic operations.
+					 */
+					while (true)
+					{
+						freelistval = pg_atomic_read_u64((volatile pg_atomic_uint64*)hctlv->freeList);
+
+						pg_atomic_exchange_u64((volatile pg_atomic_uint64*)currBucket,*(uint64*)hctlv->freeList);
+						if (pg_atomic_compare_exchange_u64((volatile pg_atomic_uint64*)hctlv->freeList,
+														   &freelistval,
+														   (uint64)currBucket))
+							break;
+					}
+				}
+				else
+				{
+					Assert(hctlv->nentries > 0);
+					hctlv->nentries--;
 
-				/* remove record from hash bucket's chain. */
-				*prevBucketPtr = currBucket->link;
+					/* remove record from hash bucket's chain. */
+					*prevBucketPtr = currBucket->link;
 
-				/* add the record to the freelist for this table.  */
-				currBucket->link = hctlv->freeList;
-				hctlv->freeList = currBucket;
-
-				if (IS_PARTITIONED(hctlv))
-					SpinLockRelease(&hctlv->mutex);
+					/* add the record to the freelist for this table.  */
+					currBucket->link = hctlv->freeList;
+					hctlv->freeList = currBucket;
+				}
 
 				/*
 				 * better hope the caller is synchronizing access to this
@@ -1152,36 +1191,69 @@ get_hash_entry(HTAB *hashp)
 	volatile HASHHDR *hctlv = hashp->hctl;
 	HASHBUCKET	newElement;
 
-	for (;;)
+	if (IS_PARTITIONED(hctlv))
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
-		if (IS_PARTITIONED(hctlv))
-			SpinLockAcquire(&hctlv->mutex);
+		uint64 newElementVal;
 
-		/* try to get an entry from the freelist */
-		newElement = hctlv->freeList;
-		if (newElement != NULL)
-			break;
-
-		/* no free elements.  allocate another chunk of buckets */
-		if (IS_PARTITIONED(hctlv))
-			SpinLockRelease(&hctlv->mutex);
-
-		if (!element_alloc(hashp, hctlv->nelem_alloc))
+		/* if partitioned, must use atomic ops to touch nentries and freeList */
+		for (;;)
 		{
-			/* out of memory */
-			return NULL;
+		   /*
+			* to remove an entry from freelist, we need to save the freelist
+			* head, and then using compare and exchange remove it from the freelist
+			* head.  Now it is quite possible that when it tries to remove an
+			* entry from head of freelist, the freelist head is already changed,
+			* so this operation needs to be tried in loop.  One point to note here
+			* is that an entry is always removed from the second slot of
+			* freelist as there doesn't seem to be a way with which we can remove
+			* it from first slot using currently supportted atomic operations.
+			*/
+			newElementVal = pg_atomic_read_u64((volatile pg_atomic_uint64*)hctlv->freeList);
+			newElement = (HASHELEMENT*)newElementVal;
+			if (newElement != NULL)
+			{
+				/* remove entry from freelist */
+				if (pg_atomic_compare_exchange_u64((volatile pg_atomic_uint64*)hctlv->freeList,
+												   &newElementVal,
+												   *(uint64*)newElement))
+					break;
+			}
+			else
+			{
+				/* no free elements.  allocate another chunk of buckets */
+				if (!element_alloc(hashp, hctlv->nelem_alloc))
+				{
+					/* out of memory */
+					return NULL;
+				}
+			}
 		}
+
+		/* bump nentries */
+		pg_atomic_add_fetch_u64((volatile pg_atomic_uint64*)&hctlv->nentries,1);
 	}
+	else
+	{
+		for (;;)
+		{
+			/* try to get an entry from the freelist */
+			newElement = hctlv->freeList;
+			if (newElement != NULL)
+				break;
 
-	/* remove entry from freelist, bump nentries */
-	hctlv->freeList = newElement->link;
-	hctlv->nentries++;
+			if (!element_alloc(hashp, hctlv->nelem_alloc))
+			{
+				/* out of memory */
+				return NULL;
+			}
+		}
 
-	if (IS_PARTITIONED(hctlv))
-		SpinLockRelease(&hctlv->mutex);
+		/* remove entry from freelist, bump nentries */
+		hctlv->freeList = newElement->link;
+		hctlv->nentries++;
+	 }
 
-	return newElement;
+	 return newElement;
 }
 
 /*
@@ -1536,14 +1608,37 @@ element_alloc(HTAB *hashp, int nelem)
 
 	/* if partitioned, must lock to touch freeList */
 	if (IS_PARTITIONED(hctlv))
-		SpinLockAcquire(&hctlv->mutex);
-
-	/* freelist could be nonempty if two backends did this concurrently */
-	firstElement->link = hctlv->freeList;
-	hctlv->freeList = prevElement;
+	{
+		uint64	freelistval;
+
+	   /*
+		* to prepare the freelist, the freelist head needs to point to the
+		* prepared list, and to achieve the same we need to use compare and
+		* exchange operation.  Now it is quite possible that when it tries to
+		* point head of freelist to newly prepared list, the freelist head is
+		* already changed, so this operation needs to be tried in loop.  One
+		* point to note here is that the list is always added to the second
+		* slot of freelist as there doesn't seem to be a way with which we
+		* can add it as first element using currently supportted atomic
+		* operations.  The first slot is always alllocated during hash_create.
+		*/
+		while (true)
+		{
+			freelistval = pg_atomic_read_u64((volatile pg_atomic_uint64*)hctlv->freeList);
 
-	if (IS_PARTITIONED(hctlv))
-		SpinLockRelease(&hctlv->mutex);
+			pg_atomic_exchange_u64((volatile pg_atomic_uint64*)firstElement,*(uint64*)hctlv->freeList);
+			if (pg_atomic_compare_exchange_u64((volatile pg_atomic_uint64*)hctlv->freeList,
+												&freelistval,
+												(uint64)prevElement))
+				break;
+		}
+	}
+	else
+	{
+		/* freelist could be nonempty if two backends did this concurrently */
+		firstElement->link = hctlv->freeList;
+		hctlv->freeList = prevElement;
+	}
 
 	return true;
 }

#127

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Amit Kapila (#126)

Re: Scaling shared buffer eviction

On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:

bgreclaimer patch + wait free lw_shared acquisition patches -
------------------------------------------------------------------------------------------

9.24%          swapper  [unknown]           [H] 0x00000000011d9c10
+   7.19%         postgres  postgres            [.] s_lock
+   3.52%         postgres  postgres            [.] GetSnapshotData
+   3.02%         postgres  postgres            [.] calc_bucket
+   2.71%         postgres  postgres            [.]
hash_search_with_hash_value
2.32%         postgres  [unknown]           [H] 0x00000000011e0d7c
+   2.17%         postgres  postgres            [.]
pg_atomic_fetch_add_u32_impl
+   1.84%         postgres  postgres            [.] AllocSetAlloc
+   1.57%         postgres  postgres            [.] _bt_compare
+   1.05%         postgres  postgres            [.] AllocSetFreeIndex
+   1.02%         postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   0.94%         postgres  postgres            [.] tas
+   0.85%          swapper  [kernel.kallsyms]   [k] .int_sqrt
+   0.80%         postgres  postgres            [.] pg_encoding_mbcliplen
+   0.78%          pgbench  [kernel.kallsyms]   [k] .find_busiest_group
0.65%          pgbench  [unknown]           [H] 0x00000000011d96e0
+   0.59%         postgres  postgres            [.] hash_any
+   0.54%         postgres  postgres            [.] LWLockRelease

This profile is without -O2 again. I really don't think it makes much
sense to draw much inference from an unoptimized build. I realize that
you said that the builds you use for benchmarking don't have that
problem, but that doesn't make this profile meaningful...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Andres Freund (#127)

Re: Scaling shared buffer eviction

On Tue, Oct 14, 2014 at 3:32 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:

After that I observed that contention for LW_SHARED has reduced
for this load, but it didn't help much in terms of performance, so I

again

rechecked the profile and this time most of the contention is moved
to spinlock used in dynahash for buf mapping tables, please refer
the profile (for 128 client count; Read only load) below:

bgreclaimer patch + wait free lw_shared acquisition patches -

------------------------------------------------------------------------------------------

This profile is without -O2 again. I really don't think it makes much
sense to draw much inference from an unoptimized build.

Profile data with -O2 is below. This shows that top
contributors are calls for BufTableLookup and spin lock caused
by BufTableInsert and BufTableDelete. To resolve spin lock
contention, patch like above might prove to be useful (although
I have to still evaluate the same). I would like to once take
LWLOCK_STATS data as well before proceeding further.
Do you have any other ideas?

11.17% swapper [unknown] [H] 0x00000000011e0328

+   4.62%         postgres  postgres            [.]
hash_search_with_hash_value
+   4.35%          pgbench  [kernel.kallsyms]   [k] .find_busiest_group

+ 3.71% postgres postgres [.] s_lock

2.56% postgres [unknown] [H] 0x0000000001500120

+ 2.23% pgbench [kernel.kallsyms] [k] .idle_cpu

+ 1.97% postgres postgres [.] LWLockAttemptLock

+ 1.73% postgres postgres [.] LWLockRelease

+   1.47%         postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   1.44%         postgres  postgres            [.] GetSnapshotData

+ 1.28% postgres postgres [.] _bt_compare

+ 1.04% swapper [kernel.kallsyms] [k] .int_sqrt

+ 1.04% postgres postgres [.] AllocSetAlloc

+ 0.97% pgbench [kernel.kallsyms] [k] .default_wake_function

Detailed Data
----------------
- 4.62% postgres postgres [.]
hash_search_with_hash_value
- hash_search_with_hash_value

- 2.19% BufTableLookup

- 2.15% BufTableLookup

ReadBuffer_common

- ReadBufferExtended

- 1.32% _bt_relandgetbuf

- 0.73% BufTableDelete

- 0.71% BufTableDelete

ReadBuffer_common

ReadBufferExtended

- 0.69% BufTableInsert

- 0.68% BufTableInsert

ReadBuffer_common

ReadBufferExtended

0.66% hash_search_with_hash_value

- 4.35% pgbench [kernel.kallsyms] [k] .find_busiest_group

- .find_busiest_group

- 4.28% .find_busiest_group

- 4.26% .load_balance

- 4.26% .idle_balance

- .__schedule

- 4.26% .schedule_hrtimeout_range_clock

.do_select

.core_sys_select

- 3.71% postgres postgres [.] s_lock

- s_lock

- 3.19% hash_search_with_hash_value

- 3.18% hash_search_with_hash_value

- 1.60% BufTableInsert

ReadBuffer_common

- ReadBufferExtended

- 1.57% BufTableDelete

ReadBuffer_common

- ReadBufferExtended

- 0.93% index_fetch_heap

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#129

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Amit Kapila (#126)

Re: Scaling shared buffer eviction

On Tue, Oct 14, 2014 at 3:24 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

On another point, I think it would be a good idea to rebase the
bgreclaimer patch over what I committed, so that we have a
clean patch against master to test with.

Please find the rebased patch attached with this mail. I have taken
some performance data as well and done some analysis based on
the same.

To reduce above contention, I tried to write a patch to replace spin lock
used in dynahash to manage free list by atomic operations. Still there
is work pending for this patch with respect to ensuring whether the
approach used in patch is completely sane, however I am posting the
patch so that others can have a look at it and give me feedback about
the approach.

After further working on this patch (replacement of spinlock in dynahash),
I found that I need to solve A-B-A problem for lockless structure as
described in section 5.1 of paper [1]http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf. I could have further pursued to
solve
it by using some additional variables as described in section 5 of paper
[1]: http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf
but it seems to me that Robert has already solved it by using some other
technique as proposed by him in patch [2]https://commitfest.postgresql.org/action/patch_view?id=1613, so it would be waste of effort
for me to pursue on this problem. So I am not planning to continue on this
patch and marking it as Rejected in CF app.

[1]: http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf
http://www.liblfds.org/mediawiki/images/1/1d/Valois_-_Lock-Free_Linked_Lists_Using_Compare-and-Swap.pdf
[2]: https://commitfest.postgresql.org/action/patch_view?id=1613

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#130

Thom Brown

thom@linux.com

about 11 years ago

In reply to: Amit Kapila (#129)

Re: Scaling shared buffer eviction

On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today

Were parts 2 and 3 committed in the end?

--
Thom

#131

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Thom Brown (#130)

Re: Scaling shared buffer eviction

On 2014-11-11 09:29:22 +0000, Thom Brown wrote:

On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

But this gets at another point: the way we're benchmarking this right
now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today

Were parts 2 and 3 committed in the end?

3 was committed. 2 wasn't because it's not yet clear whether how
beneficial it is generally.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Andres Freund (#131)

Re: Scaling shared buffer eviction

On Tue, Nov 11, 2014 at 3:00 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-11-11 09:29:22 +0000, Thom Brown wrote:

On 26 September 2014 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 23, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

But this gets at another point: the way we're benchmarking this

right

now, we're really conflating the effects of three different things:

1. Changing the locking regimen around the freelist and clocksweep.
2. Adding a bgreclaimer process.
3. Raising the number of buffer locking partitions.

First of all thanks for committing part-1 of this changes and it
seems you are planing to commit part-3 based on results of tests
which Andres is planing to do and for remaining part (part-2), today

Were parts 2 and 3 committed in the end?

3 was committed. 2 wasn't because it's not yet clear whether how
beneficial it is generally.

As shown upthread that this patch (as it stands today) is dependent on
another patch (wait free LW_SHARED acquisition) which is still not
committed and still some more work is needed to clearly show the
gain by this patch, so I have marked it as "Returned with Feedback".

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com