Scaling shared buffer eviction
As mentioned previously about my interest in improving shared
buffer eviction especially by reducing contention around
BufFreelistLock, I would like to share my progress about the
same.
The test used for this work is mainly the case when all the
data doesn't fit in shared buffers, but does fit in memory.
It is mainly based on previous comparison done by Robert
for similar workload:
http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html
To start with, I have taken LWLOCK_STATS report to confirm
the contention around BufFreelistLock and the data for HEAD
is as follows:
M/c details
IBM POWER-7 16 cores, 64 hardware threads
RAM - 64GB
Test
scale factor = 3000
shared_buffers = 8GB
number_of_threads = 64
duration = 5mins
./pgbench -c 64 -j 64 -T 300 -S postgres
LWLOCK_STATS data for BufFreeListLock
PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023
Here the high *blk* count for scale factor 3000 clearly shows
that to find a usable buffer when data doesn't fit in shared buffers
it has to wait.
To solve this issue, I have implemented a patch which makes
sure that there are always enough buffers on freelist such that
the need for backend to run clock-sweep is minimal, the
implementation idea is more or less same as discussed
previously in below thread, so I will explain it at end of mail.
/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com
LWLOCK_STATS data after Patch (test used is same as
used for HEAD):
BufFreeListLock
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0
Here the low *exacq* and *blk* count shows that the need to
run clock sweep for backend has reduced significantly.
Performance Data
-------------------------------
shared_buffers= 8GB
number of threads - 64
sc - scale factor
sc tps
Head 3000 45569
Patch 3000 46457
Head 1000 93037
Patch 1000 92711
Above data shows that there is no significant change in
performance or scalability even after the contention is
reduced significantly around BufFreelistLock.
I have analyzed the patch both with perf record and
LWLOCK_STATS, both indicates that there is a high
contention around BufMappingLocks.
Data With perf record -a -g
-----------------------------------------
+ 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
+ 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
+ 6.88% postgres [kernel.kallsyms] [k]
.function_trace_call
+ 4.15% pgbench [kernel.kallsyms] [k] .try_to_wake_up
+ 3.20% swapper [kernel.kallsyms] [k]
.function_trace_call
+ 2.99% pgbench [kernel.kallsyms] [k]
.function_trace_call
+ 2.41% postgres postgres [.] AllocSetAlloc
+ 2.38% postgres [kernel.kallsyms] [k] .try_to_wake_up
+ 2.27% pgbench [kernel.kallsyms] [k] ._raw_spin_lock
+ 1.49% postgres [kernel.kallsyms] [k]
._raw_spin_lock_irq
+ 1.36% postgres postgres [.]
AllocSetFreeIndex
+ 1.09% swapper [kernel.kallsyms] [k] ._raw_spin_lock
+ 0.91% postgres postgres [.] GetSnapshotData
+ 0.90% postgres postgres [.]
MemoryContextAllocZeroAligned
Expanded graph
------------------------------
- 10.14% swapper [kernel.kallsyms] [k]
.pseries_dedicated_idle_sleep
- .pseries_dedicated_idle_sleep
- 10.13% .pseries_dedicated_idle_sleep
- 10.13% .cpu_idle
- 10.00% .start_secondary
.start_secondary_prolog
- 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock
- ._raw_spin_lock
- 6.63% ._raw_spin_lock
- 5.95% .double_rq_lock
- .load_balance
- 5.95% .__schedule
- .schedule
- 3.27% .SyS_semtimedop
.SyS_ipc
syscall_exit
semop
PGSemaphoreLock
LWLockAcquireCommon
- LWLockAcquire
- 3.27% BufferAlloc
ReadBuffer_common
- ReadBufferExtended
- 3.27% ReadBuffer
- 2.73% ReleaseAndReadBuffer
- 1.70% _bt_relandgetbuf
_bt_search
_bt_first
btgettuple
It shows BufferAlloc->LWLOCK as top contributor and we use
BufMappingLocks in BufferAlloc, I have checked other expanded
calls as well, StrategyGetBuffer is not present in top contributors.
Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocks
PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22
This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.
As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.
Details of patch
------------------------
1. Changed bgwriter to move buffers (having usage_count as zero)
on free list based on threshold (high_watermark) and decrement the
usage count if usage_count is greater than zero.
2. StrategyGetBuffer() will wakeup bgwriter when the number of
buffers in freelist drop under low_watermark.
Currently I am using hard-coded values, we can choose to make
them as configurable later on if required.
3. Work done to get a buffer from freelist is done under spin lock
and clock sweep still runs under BufFreelistLock.
This is still a WIP patch and some of the changes are just kind
of prototype to check the idea, like I have hacked bgwriter code
such that it continuously fills the freelist till it is able to put
enough buffers on freelist such that it reaches high_watermark
and commented some part of previous code.
Thoughts?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
scalable_buffer_eviction_v1.patchapplication/octet-stream; name=scalable_buffer_eviction_v1.patchDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..f2804f1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -252,12 +252,19 @@ BackgroundWriterMain(void)
prev_hibernate = false;
/*
+ * Initialize the freelist latch. ToDo, this needs to be done under
+ * spinlock which will be used to protect freelist.
+ */
+
+ StrategyInitFreeListLatch(&MyProc->procLatch);
+
+ /*
* Loop forever
*/
for (;;)
{
- bool can_hibernate;
- int rc;
+ bool can_hibernate = 0;
+ int rc = 0;
/* Clear any already-pending wakeups */
ResetLatch(&MyProc->procLatch);
@@ -281,7 +288,7 @@ BackgroundWriterMain(void)
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync();
+ /*can_hibernate = BgBufferSync(); */
/*
* Send off activity statistics to the stats collector
@@ -339,6 +346,14 @@ BackgroundWriterMain(void)
}
/*
+ * Sleep untill signalled by backend.
+ */
+ WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
+
+ BgBufferSyncAndMoveBuffersToFreelist();
+
+
+ /*
* Sleep until we are signaled or BgWriterDelay has elapsed.
*
* Note: the feedback control loop in BgBufferSync() expects that we
@@ -348,9 +363,9 @@ BackgroundWriterMain(void)
* down with latch events that are likely to happen frequently during
* normal operation.
*/
- rc = WaitLatch(&MyProc->procLatch,
+ /*rc = WaitLatch(&MyProc->procLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay /* ms */ );
+ BgWriterDelay ms );*/
/*
* If no latch event and BgBufferSync says nothing's happening, extend
@@ -370,17 +385,17 @@ BackgroundWriterMain(void)
* for two consecutive cycles. Also, we mitigate any possible
* consequences of a missed wakeup by not hibernating forever.
*/
- if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
- {
+ /*if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
+ {*/
/* Ask for notification at next buffer allocation */
- StrategyNotifyBgWriter(&MyProc->procLatch);
+ /*StrategyNotifyBgWriter(&MyProc->procLatch);*/
/* Sleep ... */
- rc = WaitLatch(&MyProc->procLatch,
+ /*rc = WaitLatch(&MyProc->procLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay * HIBERNATE_FACTOR);
+ BgWriterDelay * HIBERNATE_FACTOR);*/
/* Reset the notification request in case we timed out */
- StrategyNotifyBgWriter(NULL);
- }
+ /*StrategyNotifyBgWriter(NULL);
+ }*/
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..7d4efed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1635,6 +1635,41 @@ BgBufferSync(void)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+ uint32 next_to_clean;
+ uint32 num_to_free;
+ int num_written;
+ volatile BufferDesc *bufHdr;
+
+ StrategySyncStartAndEnd(&next_to_clean, &num_to_free);
+
+ /* Make sure we can handle the pin inside SyncOneBuffer */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ num_written = 0;
+
+ /* Execute the LRU scan */
+ while (num_to_free > 0)
+ {
+ int buffer_state = SyncOneBuffer(next_to_clean, true);
+
+ bufHdr = &BufferDescriptors[next_to_clean];
+ if (++next_to_clean >= NBuffers)
+ next_to_clean = 0;
+ if (buffer_state & BUF_WRITTEN)
+ ++num_written;
+ if (buffer_state & BUF_REUSABLE)
+ {
+ if (StrategyMoveBufferToFreeListEnd (bufHdr))
+ num_to_free--;
+ }
+ }
+
+ BgWriterStats.m_buf_written_clean += num_written;
+}
+
/*
* SyncOneBuffer -- process a single buffer during syncing.
*
@@ -1673,6 +1708,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
+ if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+ bufHdr->usage_count--;
UnlockBufHdr(bufHdr);
return result;
}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..90e3f40 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
+ int numFreeListBuffers; /* number of buffers on freelist */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -42,6 +43,10 @@ typedef struct
uint32 completePasses; /* Complete cycles of the clock sweep */
uint32 numBufferAllocs; /* Buffers allocated since last reset */
+ Latch *freelistLatch; /* Latch to wake bgwriter */
+ /* protects freelist variables (firstFreeBuffer, lastFreeBuffer, numFreeListBuffers, BufferDesc->freeNext)*/
+ slock_t freelist_lck;
+
/*
* Notification latch, or NULL if none. See StrategyNotifyBgWriter.
*/
@@ -112,7 +117,6 @@ volatile BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
{
volatile BufferDesc *buf;
- Latch *bgwriterLatch;
int trycounter;
/*
@@ -129,31 +133,16 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
}
}
- /* Nope, so lock the freelist */
- *lock_held = true;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
- /*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
- */
- StrategyControl->numBufferAllocs++;
+ *lock_held = false;
/*
- * If bgwriterLatch is set, we need to waken the bgwriter, but we should
- * not do so while holding BufFreelistLock; so release and re-grab. This
- * is annoyingly tedious, but it happens at most once per bgwriter cycle,
- * so the performance hit is minimal.
+ * ideally numFreeListBuffers should get called under freelist
+ * spinlock, however here we need this number for estimating
+ * approximate number of free buffers required on freelist,
+ * so it would be okay, even if numFreeListBuffers is not exact.
*/
- bgwriterLatch = StrategyControl->bgwriterLatch;
- if (bgwriterLatch)
- {
- StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }
+ if (StrategyControl->numFreeListBuffers < 200)
+ SetLatch(StrategyControl->freelistLatch);
/*
* Try to get a buffer from the freelist. Note that the freeNext fields
@@ -161,34 +150,51 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
* individual buffer spinlocks, so it's OK to manipulate them without
* holding the spinlock.
*/
- while (StrategyControl->firstFreeBuffer >= 0)
+ for(;;)
{
- buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
+ if (StrategyControl->firstFreeBuffer >= 0)
+ {
+ buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+ Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot use
- * it; discard it and retry. (This can only happen if VACUUM put a
- * valid buffer in the freelist and then someone else used it before
- * we got to it. It's probably impossible altogether as of 8.3, but
- * we'd better check anyway.)
- */
- LockBufHdr(buf);
- if (buf->refcount == 0 && buf->usage_count == 0)
+ /* Unconditionally remove buffer from freelist */
+ StrategyControl->firstFreeBuffer = buf->freeNext;
+ buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ /*
+ * If the buffer is pinned or has a nonzero usage_count, we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else used it before
+ * we got to it. It's probably impossible altogether as of 8.3, but
+ * we'd better check anyway.)
+ */
+ LockBufHdr(buf);
+ if (buf->refcount == 0 && buf->usage_count == 0)
+ {
+ if (strategy != NULL)
+ AddBufferToRing(strategy, buf);
+ return buf;
+ }
+ UnlockBufHdr(buf);
+ }
+ else
{
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- return buf;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ break;
}
- UnlockBufHdr(buf);
}
/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
+ *lock_held = true;
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +202,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
- StrategyControl->completePasses++;
+ /*StrategyControl->completePasses++;*/
}
/*
@@ -241,7 +247,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
* It is possible that we are told to put something in the freelist that
@@ -253,11 +259,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+ ++StrategyControl->numFreeListBuffers;
}
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+ bool freed = false;
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ /*
+ * It is possible that we are told to put something in the freelist that
+ * is already in it; don't screw up the list if so.
+ */
+ if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+ {
+ ++StrategyControl->numFreeListBuffers;
+ freed = true;
+ /*
+ * put the buffer on end of list and if list is empty then
+ * assign first and last freebuffer with this buffer id.
+ */
+ buf->freeNext = FREENEXT_END_OF_LIST;
+ if (StrategyControl->firstFreeBuffer < 0)
+ {
+ StrategyControl->firstFreeBuffer = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ return freed;
+ }
+ BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return freed;
}
+
/*
* StrategySyncStart -- tell BufferSync where to start syncing
*
@@ -287,6 +332,31 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
return result;
}
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end)
+{
+ int curfreebuffers;
+ int reqfreebuffers;
+
+ /*
+ * ideally numFreeListBuffers should get called under
+ * freelist spin lock, however here we need this number for
+ * estimating approximate number of free buffers required
+ * on freelist, so it would be okay, even if numFreeListBuffers is not exact.
+ */
+
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ *start = StrategyControl->nextVictimBuffer;
+ curfreebuffers = StrategyControl->numFreeListBuffers;
+ reqfreebuffers = 2000;
+ if (reqfreebuffers > curfreebuffers)
+ *end = reqfreebuffers - curfreebuffers;
+ else
+ *end = 0;
+ LWLockRelease(BufFreelistLock);
+ return;
+}
+
/*
* StrategyNotifyBgWriter -- set or clear allocation notification latch
*
@@ -309,6 +379,19 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
}
+void
+StrategyInitFreeListLatch(Latch *bgwriterLatch)
+{
+ /*
+ * We acquire the BufFreelistLock just to ensure that the store appears
+ * atomic to StrategyGetBuffer. The bgwriter should call this rather
+ * infrequently, so there's no performance penalty from being safe.
+ */
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+ StrategyControl->freelistLatch= bgwriterLatch;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
/*
* StrategyShmemSize
*
@@ -376,6 +459,7 @@ StrategyInitialize(bool init)
*/
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ StrategyControl->numFreeListBuffers = NBuffers;
/* Initialize the clock sweep pointer */
StrategyControl->nextVictimBuffer = 0;
@@ -386,6 +470,8 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwriterLatch = NULL;
+ StrategyControl->freelistLatch = NULL;
+ SpinLockInit(&StrategyControl->freelist_lck);
}
else
Assert(!init);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..05ff723 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -188,11 +188,14 @@ extern BufferDesc *LocalBufferDescriptors;
extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
bool *lock_held);
extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf);
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncStartAndEnd(uint32 *start, uint32 *end);
extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitFreeListLatch(Latch *bgwriterLatch);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
extern void AtProcExit_LocalBuffers(void);
On Thu, May 15, 2014 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:
Data with LWLOCK_STATS
----------------------------------------------
BufMappingLocksPID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101
PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72
PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54
PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60
PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128
PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81
PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503
PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71
PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46
PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30
PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14
PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25
PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30
PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28
PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25
PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22
PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29
PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36
PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22This data shows that after patch, there is no contention
for BufFreeListLock, rather there is a huge contention around
BufMappingLocks. I have checked that HEAD also has contention
around BufMappingLocks.As per my analysis till now, I think reducing contention around
BufFreelistLock is not sufficient to improve scalability, we need
to work on reducing contention around BufMappingLocks as well.
To reduce the contention around BufMappingLocks, I have tried the patch
by just increasing the Number of Buffer Partitions, and it actually shows
a really significant increase in scalability both due to reduced contention
around BufFreeListLock and BufMappingLocks. The real effect of reducing
contention around BufFreeListLock was hidden because the whole contention
was shifted to BufMappingLocks. I have taken performance data for both
HEAD+increase_buf_part and Patch+increase_buf_part to clearly see the
benefit of reducing contention around BufFreeListLock. This data has been
taken using pgbench read only load (Select).
Performance Data
-------------------------------
HEAD + 64 = HEAD + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
V1 + 64 = PATCH + (NUM_BUFFER_PARTITONS(64) +
LOG2_NUM_LOCK_PARTITIONS(6))
Similarly 128 means 128 buffer partitions
shared_buffers= 8GB
scale factor = 3000
RAM - 64GB
Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544
shared_buffers= 8GB
scale factor = 1000
RAM - 64GB
Thrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822
Observations
-------------------------
1. There is increase of upto 5 times in performance for data that can fit
in memory but not in shared buffers
2. Though there is a increase in performance by just increasing number
of buffer partitions, but it doesn't scale well (especially see the case
when partitions have increased to 128 from 64).
I have verified that contention has reduced around BufMappingLocks
by running the patch with LWLOCKS
BufFreeListLock
PID 17894 lwlock main 0: shacq 0 exacq 171 blk 27 spindelay 1
BufMappingLocks
PID 17902 lwlock main 38: shacq 12770 exacq 10104 blk 282 spindelay 0
PID 17924 lwlock main 39: shacq 11409 exacq 10257 blk 243 spindelay 0
PID 17929 lwlock main 40: shacq 13120 exacq 10739 blk 239 spindelay 0
PID 17940 lwlock main 41: shacq 11865 exacq 10373 blk 262 spindelay 0
..
..
PID 17831 lwlock main 162: shacq 12706 exacq 10267 blk 199 spindelay 0
PID 17826 lwlock main 163: shacq 11081 exacq 10256 blk 168 spindelay 0
PID 17903 lwlock main 164: shacq 11494 exacq 10375 blk 176 spindelay 0
PID 17899 lwlock main 165: shacq 12043 exacq 10485 blk 216 spindelay 0
We can clearly notice that the number for *blk* has reduced significantly
which shows that contention has reduced.
The patch is still in a shape to prove the merit of idea and I have just
changed the number of partitions so that if someone wants to verify
the performance for similar load, it can be done by just applying
the patch.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
scalable_buffer_eviction_v2.patchapplication/octet-stream; name=scalable_buffer_eviction_v2.patchDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..f2804f1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -252,12 +252,19 @@ BackgroundWriterMain(void)
prev_hibernate = false;
/*
+ * Initialize the freelist latch. ToDo, this needs to be done under
+ * spinlock which will be used to protect freelist.
+ */
+
+ StrategyInitFreeListLatch(&MyProc->procLatch);
+
+ /*
* Loop forever
*/
for (;;)
{
- bool can_hibernate;
- int rc;
+ bool can_hibernate = 0;
+ int rc = 0;
/* Clear any already-pending wakeups */
ResetLatch(&MyProc->procLatch);
@@ -281,7 +288,7 @@ BackgroundWriterMain(void)
/*
* Do one cycle of dirty-buffer writing.
*/
- can_hibernate = BgBufferSync();
+ /*can_hibernate = BgBufferSync(); */
/*
* Send off activity statistics to the stats collector
@@ -339,6 +346,14 @@ BackgroundWriterMain(void)
}
/*
+ * Sleep untill signalled by backend.
+ */
+ WaitLatch(&MyProc->procLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1);
+
+ BgBufferSyncAndMoveBuffersToFreelist();
+
+
+ /*
* Sleep until we are signaled or BgWriterDelay has elapsed.
*
* Note: the feedback control loop in BgBufferSync() expects that we
@@ -348,9 +363,9 @@ BackgroundWriterMain(void)
* down with latch events that are likely to happen frequently during
* normal operation.
*/
- rc = WaitLatch(&MyProc->procLatch,
+ /*rc = WaitLatch(&MyProc->procLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay /* ms */ );
+ BgWriterDelay ms );*/
/*
* If no latch event and BgBufferSync says nothing's happening, extend
@@ -370,17 +385,17 @@ BackgroundWriterMain(void)
* for two consecutive cycles. Also, we mitigate any possible
* consequences of a missed wakeup by not hibernating forever.
*/
- if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
- {
+ /*if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
+ {*/
/* Ask for notification at next buffer allocation */
- StrategyNotifyBgWriter(&MyProc->procLatch);
+ /*StrategyNotifyBgWriter(&MyProc->procLatch);*/
/* Sleep ... */
- rc = WaitLatch(&MyProc->procLatch,
+ /*rc = WaitLatch(&MyProc->procLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay * HIBERNATE_FACTOR);
+ BgWriterDelay * HIBERNATE_FACTOR);*/
/* Reset the notification request in case we timed out */
- StrategyNotifyBgWriter(NULL);
- }
+ /*StrategyNotifyBgWriter(NULL);
+ }*/
/*
* Emergency bailout if postmaster has died. This is to avoid the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..7d4efed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1635,6 +1635,41 @@ BgBufferSync(void)
return (bufs_to_lap == 0 && recent_alloc == 0);
}
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+ uint32 next_to_clean;
+ uint32 num_to_free;
+ int num_written;
+ volatile BufferDesc *bufHdr;
+
+ StrategySyncStartAndEnd(&next_to_clean, &num_to_free);
+
+ /* Make sure we can handle the pin inside SyncOneBuffer */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ num_written = 0;
+
+ /* Execute the LRU scan */
+ while (num_to_free > 0)
+ {
+ int buffer_state = SyncOneBuffer(next_to_clean, true);
+
+ bufHdr = &BufferDescriptors[next_to_clean];
+ if (++next_to_clean >= NBuffers)
+ next_to_clean = 0;
+ if (buffer_state & BUF_WRITTEN)
+ ++num_written;
+ if (buffer_state & BUF_REUSABLE)
+ {
+ if (StrategyMoveBufferToFreeListEnd (bufHdr))
+ num_to_free--;
+ }
+ }
+
+ BgWriterStats.m_buf_written_clean += num_written;
+}
+
/*
* SyncOneBuffer -- process a single buffer during syncing.
*
@@ -1673,6 +1708,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
+ if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+ bufHdr->usage_count--;
UnlockBufHdr(bufHdr);
return result;
}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..90e3f40 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
+ int numFreeListBuffers; /* number of buffers on freelist */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -42,6 +43,10 @@ typedef struct
uint32 completePasses; /* Complete cycles of the clock sweep */
uint32 numBufferAllocs; /* Buffers allocated since last reset */
+ Latch *freelistLatch; /* Latch to wake bgwriter */
+ /* protects freelist variables (firstFreeBuffer, lastFreeBuffer, numFreeListBuffers, BufferDesc->freeNext)*/
+ slock_t freelist_lck;
+
/*
* Notification latch, or NULL if none. See StrategyNotifyBgWriter.
*/
@@ -112,7 +117,6 @@ volatile BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
{
volatile BufferDesc *buf;
- Latch *bgwriterLatch;
int trycounter;
/*
@@ -129,31 +133,16 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
}
}
- /* Nope, so lock the freelist */
- *lock_held = true;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
- /*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
- */
- StrategyControl->numBufferAllocs++;
+ *lock_held = false;
/*
- * If bgwriterLatch is set, we need to waken the bgwriter, but we should
- * not do so while holding BufFreelistLock; so release and re-grab. This
- * is annoyingly tedious, but it happens at most once per bgwriter cycle,
- * so the performance hit is minimal.
+ * ideally numFreeListBuffers should get called under freelist
+ * spinlock, however here we need this number for estimating
+ * approximate number of free buffers required on freelist,
+ * so it would be okay, even if numFreeListBuffers is not exact.
*/
- bgwriterLatch = StrategyControl->bgwriterLatch;
- if (bgwriterLatch)
- {
- StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }
+ if (StrategyControl->numFreeListBuffers < 200)
+ SetLatch(StrategyControl->freelistLatch);
/*
* Try to get a buffer from the freelist. Note that the freeNext fields
@@ -161,34 +150,51 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
* individual buffer spinlocks, so it's OK to manipulate them without
* holding the spinlock.
*/
- while (StrategyControl->firstFreeBuffer >= 0)
+ for(;;)
{
- buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
+ if (StrategyControl->firstFreeBuffer >= 0)
+ {
+ buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+ Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot use
- * it; discard it and retry. (This can only happen if VACUUM put a
- * valid buffer in the freelist and then someone else used it before
- * we got to it. It's probably impossible altogether as of 8.3, but
- * we'd better check anyway.)
- */
- LockBufHdr(buf);
- if (buf->refcount == 0 && buf->usage_count == 0)
+ /* Unconditionally remove buffer from freelist */
+ StrategyControl->firstFreeBuffer = buf->freeNext;
+ buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ /*
+ * If the buffer is pinned or has a nonzero usage_count, we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else used it before
+ * we got to it. It's probably impossible altogether as of 8.3, but
+ * we'd better check anyway.)
+ */
+ LockBufHdr(buf);
+ if (buf->refcount == 0 && buf->usage_count == 0)
+ {
+ if (strategy != NULL)
+ AddBufferToRing(strategy, buf);
+ return buf;
+ }
+ UnlockBufHdr(buf);
+ }
+ else
{
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- return buf;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ break;
}
- UnlockBufHdr(buf);
}
/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
+ *lock_held = true;
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +202,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
- StrategyControl->completePasses++;
+ /*StrategyControl->completePasses++;*/
}
/*
@@ -241,7 +247,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
* It is possible that we are told to put something in the freelist that
@@ -253,11 +259,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+ ++StrategyControl->numFreeListBuffers;
}
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+ bool freed = false;
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ /*
+ * It is possible that we are told to put something in the freelist that
+ * is already in it; don't screw up the list if so.
+ */
+ if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+ {
+ ++StrategyControl->numFreeListBuffers;
+ freed = true;
+ /*
+ * put the buffer on end of list and if list is empty then
+ * assign first and last freebuffer with this buffer id.
+ */
+ buf->freeNext = FREENEXT_END_OF_LIST;
+ if (StrategyControl->firstFreeBuffer < 0)
+ {
+ StrategyControl->firstFreeBuffer = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ return freed;
+ }
+ BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return freed;
}
+
/*
* StrategySyncStart -- tell BufferSync where to start syncing
*
@@ -287,6 +332,31 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
return result;
}
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end)
+{
+ int curfreebuffers;
+ int reqfreebuffers;
+
+ /*
+ * ideally numFreeListBuffers should get called under
+ * freelist spin lock, however here we need this number for
+ * estimating approximate number of free buffers required
+ * on freelist, so it would be okay, even if numFreeListBuffers is not exact.
+ */
+
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ *start = StrategyControl->nextVictimBuffer;
+ curfreebuffers = StrategyControl->numFreeListBuffers;
+ reqfreebuffers = 2000;
+ if (reqfreebuffers > curfreebuffers)
+ *end = reqfreebuffers - curfreebuffers;
+ else
+ *end = 0;
+ LWLockRelease(BufFreelistLock);
+ return;
+}
+
/*
* StrategyNotifyBgWriter -- set or clear allocation notification latch
*
@@ -309,6 +379,19 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
}
+void
+StrategyInitFreeListLatch(Latch *bgwriterLatch)
+{
+ /*
+ * We acquire the BufFreelistLock just to ensure that the store appears
+ * atomic to StrategyGetBuffer. The bgwriter should call this rather
+ * infrequently, so there's no performance penalty from being safe.
+ */
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+ StrategyControl->freelistLatch= bgwriterLatch;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
/*
* StrategyShmemSize
*
@@ -376,6 +459,7 @@ StrategyInitialize(bool init)
*/
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ StrategyControl->numFreeListBuffers = NBuffers;
/* Initialize the clock sweep pointer */
StrategyControl->nextVictimBuffer = 0;
@@ -386,6 +470,8 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwriterLatch = NULL;
+ StrategyControl->freelistLatch = NULL;
+ SpinLockInit(&StrategyControl->freelist_lck);
}
else
Assert(!init);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..05ff723 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -188,11 +188,14 @@ extern BufferDesc *LocalBufferDescriptors;
extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
bool *lock_held);
extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf);
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncStartAndEnd(uint32 *start, uint32 *end);
extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitFreeListLatch(Latch *bgwriterLatch);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 175fae3..fe86e07 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,10 +136,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
*/
/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 16
+#define NUM_BUFFER_PARTITIONS 128
/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS 4
+#define LOG2_NUM_LOCK_PARTITIONS 7
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
/* Number of partitions the shared predicate lock tables are divided into */
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:
Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544
I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.
In terms of ameliorating contention on the buffer mapping locks, I think it
would be better to replace the whole buffer mapping table with something
different. I started working on that almost 2 years ago, building a
hash-table that can be read without requiring any locks and written with,
well, less locking than what we have right now:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash
I never got quite as far as trying to hook that up to the buffer mapping
machinery, but maybe that would be worth doing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>wrote:
shared_buffers= 8GB
scale factor = 3000
RAM - 64GBThrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544shared_buffers= 8GB
scale factor = 1000
RAM - 64GBThrds (64) Thrds (128) HEAD 92142 31050 HEAD + 64 108120 86367 V1 + 64
117454 123429 HEAD + 128 107762 86902 V1 + 128 123641 124822
I'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run? What does "Thrds" denote?
--
Peter Geoghegan
On Sat, May 17, 2014 at 6:29 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Fri, May 16, 2014 at 7:51 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
shared_buffers= 8GB
scale factor = 3000
RAM - 64GBI'm having a little trouble following this. These figure are transactions
per second for a 300 second pgbench tpc-b run?
Yes, the figures are tps for a 300 second run.
It is for select-only transactions.
What does "Thrds" denote?
It denotes number of threads (-j in pgbench run)
I have used below statements to take data
./pgbench -c 64 -j 64 -T 300 -S postgres
./pgbench -c 128 -j 128 -T 300 -S postgres
The reason for posting the numbers for 64/128 threads is because we have
mainly concurrency bottleneck when the number of connections are higher
than CPU cores and I am using 16 cores, 64 hardware threads m/c.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.
In terms of ameliorating contention on the buffer mapping locks, I think
it would be better to replace the whole buffer mapping table with something
different.
Is there anything bad except for may be increase in LWLocks with scaling
hash partitions w.r.t to shared buffers either by auto tuning or by having a
configuration knob. I understand that it would be bit difficult for users
to
estimate the correct value of such a parameter, we can provide info about
its usage in docs such that if user increases shared buffers by 'X' (20
times)
of default value (128M), then consider increasing such partitions and it
should
be always power of 2 or does something similar to above internally in code.
I agree that may be even by having a reasonably good estimate of number of
partitions w.r.t shared buffers, we might not be able to eliminate the
contention
around BufMappingLocks, but I think the scalability we get by doing that is
not
bad either.
I started working on that almost 2 years ago, building a hash-table that
can be read without requiring any locks and written with, well, less
locking than what we have right now:
I have still not read the complete code, but by just going through initial
file
header, it seems to me that it will be much better than current
implementation in terms of concurrency, by the way does such an
implementation can extend to reducing scalability for hash indexes as well?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, May 17, 2014 at 6:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, May 16, 2014 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:Thrds (64) Thrds (128) HEAD 45562 17128 HEAD + 64 57904 32810 V1 + 64
105557 81011 HEAD + 128 58383 32997 V1 + 128 110705 114544I haven't actually reviewed the code, but this sort of thing seems like
good evidence that we need your patch, or something like it. The fact that
the patch produces little performance improvement on it's own (though it
does produce some) shouldn't be held against it - the fact that the
contention shifts elsewhere when the first bottleneck is removed is not
your patch's fault.
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.
b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG
c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.
d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the number
as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.
e. One reason why I think number of buf-partitions is hard-coded to 16 is
that
minimum number of shared buffers allowed are 16 (128kb). However,
there
is handling in code (in function init_htab()) which ensure that even
if number
of partitions are more that shared buffers, it handles it safely.
I have checked the bgwriter CPU usage with and without patch
for various configurations and the observation is that for most of the
loads bgwriter's CPU usage after patch is between 8~20% and in
HEAD it is 0~2%. It shows that with patch when shared buffers
are under use by backends, bgwriter is constantly doing work to
ease the work of backends. Detailed data is provided later in the
mail.
Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Client Count/patch_ver (tps) 8 16 32 64 128 Head 26220 48686 70779 45232
17310 Patch 26402 50726 75574 111468 114521
Data is taken by using script (pert_buff_mgmt.sh) attached with mail.
This data is read-only pgbench data with different number of client
connections. All the numbers are in tps. This data is median of 3
5 min pgbench read-only runs. Please find the detailed data for 3 runs
in attached open office document (perf_read_scalability_data_v3.ods)
This data clearly shows that patch has improved improved performance
upto 5~6 times.
Results of BGwriter CPU usage:
--------------------------------------------------
Here sc is scale factor and sb is shared buffers and the data is
for read-only pgbench runs.
./pgbench -c 64 - j 64 -S -T 300 postgres
sc - 3000, sb - 8GB
HEAD
CPU usage - 0~2.3%
Patch v_3
CPU usage - 8.6%
sc - 100, sb - 128MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 1~2%
tps- 36199.047132
Patch v_3
CPU usage - 12~13%
tps = 109182.681827
sc - 50, sb - 75MB
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
CPU Usage - 0.7~2%
tps- 37760.575128
Patch v_3
CPU usage - 20~22%
tps = 106310.744198
./pgbench -c 16 - j 16 -S -T 300 postgres
sc - 100, sb - 128kb
--need to change pgbench for this.
HEAD
CPU Usage - 0~0.3%
tps- 40979.529254
Patch v_3
CPU usage - 35~40%
tps = 42956.785618
Results of LWLOCK_STATS based on low-high threshold values of freelist:
--------------------------------------------------------------------------------------------------------------
In the results, values of exacq and blk shows the contention on freelist
lock.
sc is scale factor and sb is number of shared_buffers. Below results shows
that for all except one (1MB) of configuration the contention around
buffreelist
lock is reduced significantly. For 1MB case also, it has reduced exacq
count
which shows that it has performed clock sweep lesser number of times.
sc - 3000, sb - 15GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 4406 lwlock main 0: shacq 0 exacq 84482 blk 5139 spindelay 62
Patch v_3
PID 4864 lwlock main 0: shacq 0 exacq 34 blk 1 spindelay 0
sc - 3000, sb - 8GB --(sb > 100000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 24124 lwlock main 0: shacq 0 exacq 285155 blk 33910 spindelay 548
Patch v_3
PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0
sc - 100, sb - 768MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 9144 lwlock main 0: shacq 0 exacq 284636 blk 34091 spindelay 555
Patch v-3 (lw=100,hg=1000)
PID 9428 lwlock main 0: shacq 0 exacq 306 blk 59 spindelay 0
sc - 100, sb - 128MB --(sb > 10000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 5405 lwlock main 0: shacq 0 exacq 285449 blk 32345 spindelay 714
Patch v-3
PID 8625 lwlock main 0: shacq 0 exacq 740 blk 178 spindelay 0
sc - 50, sb - 75MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 12681 lwlock main 0: shacq 0 exacq 289347 blk 34064 spindelay 773
Patch v3
PID 12800 lwlock main 0: shacq 0 exacq 76287 blk 15183 spindelay 28
sc - 50, sb - 10MB --(sb > 1000)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 10283 lwlock main 0: shacq 0 exacq 287500 blk 32177 spindelay 864
Patch v3 (for > 1000, lw = 50 hg =200)
PID 11629 lwlock main 0: shacq 0 exacq 60139 blk 12978 spindelay 40
sc - 1, sb - 7MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 47127 lwlock main 0: shacq 0 exacq 289462 blk 37057 spindelay 119
Patch v3
PID 47283 lwlock main 0: shacq 0 exacq 9507 blk 1656 spindelay 0
sc - 1, sb - 1MB --(sb > 100)
./pgbench -c 64 - j 64 -S -T 300 postgres
HEAD
PID 43215 lwlock main 0: shacq 0 exacq 301384 blk 36740 spindelay 902
Patch v3
PID 46542 lwlock main 0: shacq 0 exacq 197231 blk 37532 spindelay 294
sc - 100, sb - 128kb focus(sb > 16)
./pgbench -c 16 - j 16 -S -T 300 postgres (for this, I need to reduce value
of naccounts to 2500, else it was always giving no unpinned buffers
available)
HEAD
PID 49751 lwlock main 0: shacq 0 exacq 1821276 blk 130119 spindelay 7
Patch v3
PID 50768 lwlock main 0: shacq 0 exacq 382610 blk 46543 spindelay 1
More Datapoints and work:
a. I have yet to take data by merging it with scalable lwlock patch of
Andres (https://commitfest.postgresql.org/action/patch_view?id=1313).
There are many conflicts in the patch, so waiting for an updated patch.
b. Read-only data for more configurations.
c. Data for Write work load (tpc-b of pgbench, Bulk insert (Copy))
d. Update docs and Remove unused code.
Suggestions?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
scalable_buffer_eviction_v3.patchapplication/octet-stream; name=scalable_buffer_eviction_v3.patchDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..ae4237d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,12 +67,6 @@
int BgWriterDelay = 200;
/*
- * Multiplier to apply to BgWriterDelay when we decide to hibernate.
- * (Perhaps this needs to be configurable?)
- */
-#define HIBERNATE_FACTOR 50
-
-/*
* Interval in which standby snapshots are logged into the WAL stream, in
* milliseconds.
*/
@@ -111,7 +105,6 @@ BackgroundWriterMain(void)
{
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
- bool prev_hibernate;
/*
* If possible, make this process a group leader, so that the postmaster
@@ -246,19 +239,15 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);
- /*
- * Reset hibernation state after any error.
- */
- prev_hibernate = false;
+ /* Initialize the freelist latch. */
+ StrategyInitBgWriterLatch(&MyProc->procLatch);
/*
* Loop forever
*/
for (;;)
{
- bool can_hibernate;
int rc;
-
/* Clear any already-pending wakeups */
ResetLatch(&MyProc->procLatch);
@@ -279,9 +268,25 @@ BackgroundWriterMain(void)
}
/*
- * Do one cycle of dirty-buffer writing.
+ * Sleep untill signalled by backend or LOG_SNAPSHOT_INTERVAL_MS has
+ * elapsed.
+ *
+ * Backend will signal bgwriter when the number of buffers in
+ * freelist fall below than low threshhold of freelist. We need
+ * to wake bgwriter after LOG_SNAPSHOT_INTERVAL_MS to ensure that
+ * it can log information about xl_running_xacts.
*/
- can_hibernate = BgBufferSync();
+ if (XLogStandbyInfoActive() && !RecoveryInProgress())
+ rc = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ LOG_SNAPSHOT_INTERVAL_MS);
+ else
+ rc = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1);
+
+ if (rc & WL_LATCH_SET)
+ BgBufferSyncAndMoveBuffersToFreelist();
/*
* Send off activity statistics to the stats collector
@@ -318,7 +323,9 @@ BackgroundWriterMain(void)
* Checkpointer, when active, is barely ever in its mainloop and thus
* makes it hard to log regularly.
*/
- if (XLogStandbyInfoActive() && !RecoveryInProgress())
+ if ((rc & WL_TIMEOUT || rc & WL_LATCH_SET) &&
+ XLogStandbyInfoActive() &&
+ !RecoveryInProgress())
{
TimestampTz timeout = 0;
TimestampTz now = GetCurrentTimestamp();
@@ -339,57 +346,11 @@ BackgroundWriterMain(void)
}
/*
- * Sleep until we are signaled or BgWriterDelay has elapsed.
- *
- * Note: the feedback control loop in BgBufferSync() expects that we
- * will call it every BgWriterDelay msec. While it's not critical for
- * correctness that that be exact, the feedback loop might misbehave
- * if we stray too far from that. Hence, avoid loading this process
- * down with latch events that are likely to happen frequently during
- * normal operation.
- */
- rc = WaitLatch(&MyProc->procLatch,
- WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay /* ms */ );
-
- /*
- * If no latch event and BgBufferSync says nothing's happening, extend
- * the sleep in "hibernation" mode, where we sleep for much longer
- * than bgwriter_delay says. Fewer wakeups save electricity. When a
- * backend starts using buffers again, it will wake us up by setting
- * our latch. Because the extra sleep will persist only as long as no
- * buffer allocations happen, this should not distort the behavior of
- * BgBufferSync's control loop too badly; essentially, it will think
- * that the system-wide idle interval didn't exist.
- *
- * There is a race condition here, in that a backend might allocate a
- * buffer between the time BgBufferSync saw the alloc count as zero
- * and the time we call StrategyNotifyBgWriter. While it's not
- * critical that we not hibernate anyway, we try to reduce the odds of
- * that by only hibernating when BgBufferSync says nothing's happening
- * for two consecutive cycles. Also, we mitigate any possible
- * consequences of a missed wakeup by not hibernating forever.
- */
- if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
- {
- /* Ask for notification at next buffer allocation */
- StrategyNotifyBgWriter(&MyProc->procLatch);
- /* Sleep ... */
- rc = WaitLatch(&MyProc->procLatch,
- WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay * HIBERNATE_FACTOR);
- /* Reset the notification request in case we timed out */
- StrategyNotifyBgWriter(NULL);
- }
-
- /*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*/
if (rc & WL_POSTMASTER_DEATH)
exit(1);
-
- prev_hibernate = can_hibernate;
}
}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f864816..396ac4d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5028,6 +5028,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
globalStats.buf_written_backend += msg->m_buf_written_backend;
globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
globalStats.buf_alloc += msg->m_buf_alloc;
+ globalStats.buf_freelist += msg->m_buf_freelist;
}
/* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c070278..c052914 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1636,6 +1636,65 @@ BgBufferSync(void)
}
/*
+ * Write out some dirty buffers in the pool and maintain enough
+ * number of buffers in freelist (equal to high threshold for
+ * freelsit), so that backend's don't need to perform clock sweep
+ * often.
+ *
+ * This is called by the background writer process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+ uint32 next_to_clean;
+ uint32 num_to_free;
+ uint32 tmp_num_to_free;
+ uint32 recent_alloc;
+ int num_written;
+ int num_freelist;
+ volatile BufferDesc *bufHdr;
+
+ num_freelist = StrategySyncStartAndEnd(&next_to_clean,
+ &num_to_free,
+ &recent_alloc);
+
+ /* Report buffer alloc and buffer freelist counts to pgstat */
+ BgWriterStats.m_buf_alloc += recent_alloc;
+ BgWriterStats.m_buf_freelist += num_freelist;
+
+ /* Make sure we can handle the pin inside SyncOneBuffer */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ num_written = 0;
+ tmp_num_to_free = num_to_free;
+
+ /* Execute the LRU scan */
+ while (tmp_num_to_free > 0)
+ {
+ int buffer_state = SyncOneBuffer(next_to_clean, true);
+
+ bufHdr = &BufferDescriptors[next_to_clean];
+ if (++next_to_clean >= NBuffers)
+ next_to_clean = 0;
+ if (buffer_state & BUF_WRITTEN)
+ ++num_written;
+ if (buffer_state & BUF_REUSABLE)
+ {
+ if (StrategyMoveBufferToFreeListEnd (bufHdr))
+ tmp_num_to_free--;
+ }
+ }
+
+ BgWriterStats.m_buf_written_clean += num_written;
+
+#ifdef BGW_DEBUG
+ elog(LOG, "bgwriter: recent_alloc=%u num_freelist=%u wrote=%d num_freed=%u",
+ recent_alloc, num_freelist, num_written, num_to_free);
+#endif
+}
+
+/*
* SyncOneBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
@@ -1672,7 +1731,13 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
result |= BUF_REUSABLE;
else if (skip_recently_used)
{
- /* Caller told us not to write recently-used buffers */
+ /*
+ * Caller told us not to write recently-used buffers and
+ * reduce usage count, so that it can find the reusable
+ * buffers in consecutive cycles.
+ */
+ if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+ bufHdr->usage_count--;
UnlockBufHdr(bufHdr);
return result;
}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..3edc0e9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
+ int numFreeListBuffers; /* number of buffers on freelist */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,7 +44,13 @@ typedef struct
uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
- * Notification latch, or NULL if none. See StrategyNotifyBgWriter.
+ * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+ * numFreeListBuffers, BufferDesc->freeNext).
+ */
+ slock_t freelist_lck;
+
+ /*
+ * Latch to wake bgwriter.
*/
Latch *bgwriterLatch;
} BufferStrategyControl;
@@ -112,7 +119,6 @@ volatile BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
{
volatile BufferDesc *buf;
- Latch *bgwriterLatch;
int trycounter;
/*
@@ -129,66 +135,78 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
}
}
- /* Nope, so lock the freelist */
- *lock_held = true;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We count buffer allocation requests so that the bgwriter can know
+ * the rate of buffer consumption and report it as stats. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
StrategyControl->numBufferAllocs++;
+ *lock_held = false;
/*
- * If bgwriterLatch is set, we need to waken the bgwriter, but we should
- * not do so while holding BufFreelistLock; so release and re-grab. This
- * is annoyingly tedious, but it happens at most once per bgwriter cycle,
- * so the performance hit is minimal.
+ * Ideally numFreeListBuffers should get called under freelist spinlock,
+ * however here we need this number for estimating approximate number of
+ * free buffers required on freelist, so it should not be a problem, even
+ * if numFreeListBuffers is not exact. bgwriterLatch is initialized in
+ * early phase of BgWriter startup, however we still check before using
+ * it to avoid any problem incase we reach here before its initializion.
*/
- bgwriterLatch = StrategyControl->bgwriterLatch;
- if (bgwriterLatch)
- {
- StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }
+ if (StrategyControl->numFreeListBuffers < freelistLowThreshold &&
+ StrategyControl->bgwriterLatch)
+ SetLatch(StrategyControl->bgwriterLatch);
/*
* Try to get a buffer from the freelist. Note that the freeNext fields
- * are considered to be protected by the BufFreelistLock not the
+ * are considered to be protected by the freelist_lck not the
* individual buffer spinlocks, so it's OK to manipulate them without
- * holding the spinlock.
+ * holding the buffer spinlock.
*/
- while (StrategyControl->firstFreeBuffer >= 0)
+ for(;;)
{
- buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
+ if (StrategyControl->firstFreeBuffer >= 0)
+ {
+ buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+ Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot use
- * it; discard it and retry. (This can only happen if VACUUM put a
- * valid buffer in the freelist and then someone else used it before
- * we got to it. It's probably impossible altogether as of 8.3, but
- * we'd better check anyway.)
- */
- LockBufHdr(buf);
- if (buf->refcount == 0 && buf->usage_count == 0)
+ /* Unconditionally remove buffer from freelist */
+ StrategyControl->firstFreeBuffer = buf->freeNext;
+ buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ /*
+ * If the buffer is pinned or has a nonzero usage_count, we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else used it before
+ * we got to it. It's probably impossible altogether as of 8.3, but
+ * we'd better check anyway.)
+ */
+ LockBufHdr(buf);
+ if (buf->refcount == 0 && buf->usage_count == 0)
+ {
+ if (strategy != NULL)
+ AddBufferToRing(strategy, buf);
+ return buf;
+ }
+ UnlockBufHdr(buf);
+ }
+ else
{
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- return buf;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ break;
}
- UnlockBufHdr(buf);
}
/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
+ *lock_held = true;
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+
for (;;)
{
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
@@ -196,7 +214,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
- StrategyControl->completePasses++;
+ /*StrategyControl->completePasses++;*/
}
/*
@@ -241,7 +259,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
* It is possible that we are told to put something in the freelist that
@@ -253,12 +271,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+ ++StrategyControl->numFreeListBuffers;
}
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
}
/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+ bool freed = false;
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ /*
+ * It is possible that we are told to put something in the freelist that
+ * is already in it; don't screw up the list if so.
+ */
+ if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+ {
+ ++StrategyControl->numFreeListBuffers;
+ freed = true;
+ /*
+ * put the buffer on end of list and if list is empty then
+ * assign first and last freebuffer with this buffer id.
+ */
+ buf->freeNext = FREENEXT_END_OF_LIST;
+ if (StrategyControl->firstFreeBuffer < 0)
+ {
+ StrategyControl->firstFreeBuffer = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ return freed;
+ }
+ BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return freed;
+}
+
+
+/*
* StrategySyncStart -- tell BufferSync where to start syncing
*
* The result is the buffer index of the best buffer to sync first.
@@ -288,6 +345,46 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
}
/*
+ * StrategySyncStartAndEnd -- tell BgWriter where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ *
+ * In addition, we return the number of of buffers on freelist.
+ */
+int
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+ int curfreebuffers;
+
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ *start = StrategyControl->nextVictimBuffer;
+ LWLockRelease(BufFreelistLock);
+
+ /*
+ * Ideally numFreeListBuffers should get called under freelist spinlock,
+ * however here we need this number for estimating approximate number of
+ * free buffers required on freelist, so it should not be a problem, even
+ * if numFreeListBuffers is not exact.
+ */
+
+ curfreebuffers = StrategyControl->numFreeListBuffers;
+ if (curfreebuffers < freelistHighThreshold)
+ *end = freelistHighThreshold - curfreebuffers;
+ else
+ *end = 0;
+ if (num_buf_alloc)
+ {
+ *num_buf_alloc = StrategyControl->numBufferAllocs;
+ StrategyControl->numBufferAllocs = 0;
+ }
+
+ return curfreebuffers;
+}
+
+/*
* StrategyNotifyBgWriter -- set or clear allocation notification latch
*
* If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -309,6 +406,12 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
}
+void
+StrategyInitBgWriterLatch(Latch *bgwriterLatch)
+{
+ StrategyControl->bgwriterLatch = bgwriterLatch;
+}
+
/*
* StrategyShmemSize
*
@@ -376,6 +479,7 @@ StrategyInitialize(bool init)
*/
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ StrategyControl->numFreeListBuffers = NBuffers;
/* Initialize the clock sweep pointer */
StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +490,42 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwriterLatch = NULL;
+ SpinLockInit(&StrategyControl->freelist_lck);
}
else
Assert(!init);
+
+ /*
+ * Initialize the low and high threshold number of buffer's
+ * for freelist. This is used to maintain buffer's on freelist
+ * so that backend doesn't often need to perform clock sweep to
+ * find the buffer.
+ */
+ if (NBuffers > 100000)
+ {
+ freelistLowThreshold = 200;
+ freelistHighThreshold = 2000;
+ }
+ else if (NBuffers > 10000)
+ {
+ freelistLowThreshold = 100;
+ freelistHighThreshold = 1000;
+ }
+ else if (NBuffers > 1000)
+ {
+ freelistLowThreshold = 50;
+ freelistHighThreshold = 200;
+ }
+ else if (NBuffers > 100)
+ {
+ freelistLowThreshold = 30;
+ freelistHighThreshold = 75;
+ }
+ else
+ {
+ freelistLowThreshold = 5;
+ freelistHighThreshold = 15;
+ }
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d9de09f..a87954a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -389,6 +389,7 @@ typedef struct PgStat_MsgBgWriter
PgStat_Counter m_buf_written_backend;
PgStat_Counter m_buf_fsync_backend;
PgStat_Counter m_buf_alloc;
+ PgStat_Counter m_buf_freelist;
PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
PgStat_Counter m_checkpoint_sync_time;
} PgStat_MsgBgWriter;
@@ -537,7 +538,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -662,6 +663,7 @@ typedef struct PgStat_GlobalStats
PgStat_Counter buf_written_backend;
PgStat_Counter buf_fsync_backend;
PgStat_Counter buf_alloc;
+ PgStat_Counter buf_freelist;
TimestampTz stat_reset_timestamp;
} PgStat_GlobalStats;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..9eb7be6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -161,6 +161,16 @@ typedef struct sbufdesc
#define FREENEXT_NOT_IN_LIST (-2)
/*
+ * Threshold indicators for maintaining buffers on freelist. When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgwriter wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/*
* Macros for acquiring/releasing a shared buffer header's spinlock.
* Do not apply these to local buffers!
*
@@ -188,11 +198,15 @@ extern BufferDesc *LocalBufferDescriptors;
extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
bool *lock_held);
extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf);
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern int StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+ uint32 *num_buf_alloc);
extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgWriterLatch(Latch *bgwriterLatch);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 175fae3..fe86e07 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,10 +136,10 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
*/
/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 16
+#define NUM_BUFFER_PARTITIONS 128
/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS 4
+#define LOG2_NUM_LOCK_PARTITIONS 7
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
/* Number of partitions the shared predicate lock tables are divided into */
Amit Kapila <amit.kapila16@gmail.com> wrote:
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.
The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls. By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.
Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Amit Kapila <amit.kapila16@gmail.com> wrote:
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist, but toward
preventing shared_buffers from accumulating a lot of dirty pages,
which were leading to cascades of writes between caches and thus to
write stalls. By pushing dirty pages into the (*much* larger) OS
cache, and letting write combining happen there, where the OS could
pace based on the total number of dirty pages instead of having
some hidden and appearing rather suddenly, latency spikes were
avoided while not causing any noticeable increase in the number of
OS writes to the RAID controller's cache.Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter.
I think it would have been better if bgwriter does writes based on
the amount of buffer's that get dirtied to achieve the balance of
writes.
Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.
I agree that for some cases as explained by you, the current bgwriter
logic does satisfy the need, however there are other cases as well
where actually it doesn't help much, one of such cases I am trying to
improve (ease backend buffer allocations) and other may be when
there is constant write activity for which I am not sure how much it
really helps. Part of the reason for trying to make bgwriter respond
mainly to ease backend allocations is the previous discussion for
the same, refer below link:
/messages/by-id/CA+TgmoZ7dvhC4h-ffJmZCff6VWyNfOEAPZ021VxW61uH46R3QA@mail.gmail.com
However if we want to retain current property of bgwriter, we can do
the same by one of below ways:
a. Have separate processes for writing dirty buffers and moving buffers
to freelist.
b. In the current bgwriter, separate the two works based on the need.
The need can be decided based on whether bgwriter has been waked
due to shortage of buffers on free list or if it has been waked due to
BgWriterDelay.
Now as populating freelist and balance writes by writing dirty buffers
are two separate responsibilities, so not sure if doing that by one
process is a good idea.
I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sun, Jun 8, 2014 at 9:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Amit Kapila <amit.kapila16@gmail.com> wrote:
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and begin
adding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.The numbers from your benchmarks are very exciting, but the above
concerns me. My tuning of the bgwriter in production has generally
*not* been aimed at keeping pages on the freelist,
Just to be clear, prior to this patch, the bgwriter has never been in
the business of putting pages on the freelist in the first place, so
it wouldn't have been possible for you to tune for that.
Essentially I was able to tune the bgwriter so that a dirty page
was always push out to the OS cache within three seconds, which led
to a healthy balance of writes between the checkpoint process and
the bgwriter. Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.
I think, as Amit says downthread, that the crucial design question
here is whether we need two processes, one to populate the freelist so
that regular backends don't need to run the clock sweep, and a second
to flush dirty buffers, or whether a single process can serve both
needs. In favor of a single process, many people have commented that
the background writer doesn't seem to do much right now. If the
process is mostly sitting around idle, then giving it more
responsibilities might be OK. In favor of having a second process,
I'm a little concerned that if the background writer gets busy writing
a page, it might then be unavailable to populate the freelist until it
finishes, which might be a very long time relative to the buffer
allocation needs of other backends. I'm not sure what the right
answer is.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jun 9, 2014 at 9:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Jun 8, 2014 at 7:21 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Backend processes related to user connections still
performed about 30% of the writes, and this work shows promise
toward bringing that down, which would be great; but please don't
eliminate the ability to prevent write stalls in the process.I am planing to take some more performance data, part of which will
be write load as well, but I am now sure if that can anyway show the
need as mentioned by you.
After taking the performance data for write load using tpc-b with the
patch, I found that there is a regression in it. So I went ahead and
tried to figure out the reason for same and found that after patch,
Bgwriter started flushing buffers which were required by backends
and reason was that *nextVictimBuffer* was not getting updated
properly while we are running clock sweep kind of logic (decrement
the usage count when number of buffers on freelist fall below low
threshhold value) in Bgwriter. In HEAD, I noticed that at default
settings, BGwriter was not at all flushing any buffers which is at least
better than what my patch was doing (flushing buffers required by
backend).
So I tried to fix the issue by updating *nextVictimBuffer* in new
BGWriter logic and results are positive.
sbe - scalable buffer eviction
Select only Data
Client count/TPS64128Un-patched4523217310sbe_v3111468114521sbe_v4153137
160752
TPC-B
Client count/TPS
64128Un-patched825784sbe_v4814845
For Select Data, I am quite confident that it will improve if we introduce
nextVictimBuffer increments in BGwriter and rather it scales much better
with that change, however for TPC-B, I am getting fluctuation in data,
so not sure it has eliminated the problem. The main difference is that in
HEAD, BGwriter never increments nextVictimBuffer during syncing the
buffers, it just notes down the current setting before start and then
proceeds sequentially.
I think it will be good if we can have a new process for moving buffers to
free list due to below reasons:
a. while trying to move buffers to freelist, it should not block due
to in between write activity.
b. The writer should not increment nextVictimBuffer and maintain
the current logic.
One significant change in this version of patch is to use a separate
spin lock to protect nextVictimBuffer rather than using BufFreelistLock.
Suggestions?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
scalable_buffer_eviction_v4.patchapplication/octet-stream; name=scalable_buffer_eviction_v4.patchDownload
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 780ee3b..ae4237d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,12 +67,6 @@
int BgWriterDelay = 200;
/*
- * Multiplier to apply to BgWriterDelay when we decide to hibernate.
- * (Perhaps this needs to be configurable?)
- */
-#define HIBERNATE_FACTOR 50
-
-/*
* Interval in which standby snapshots are logged into the WAL stream, in
* milliseconds.
*/
@@ -111,7 +105,6 @@ BackgroundWriterMain(void)
{
sigjmp_buf local_sigjmp_buf;
MemoryContext bgwriter_context;
- bool prev_hibernate;
/*
* If possible, make this process a group leader, so that the postmaster
@@ -246,19 +239,15 @@ BackgroundWriterMain(void)
*/
PG_SETMASK(&UnBlockSig);
- /*
- * Reset hibernation state after any error.
- */
- prev_hibernate = false;
+ /* Initialize the freelist latch. */
+ StrategyInitBgWriterLatch(&MyProc->procLatch);
/*
* Loop forever
*/
for (;;)
{
- bool can_hibernate;
int rc;
-
/* Clear any already-pending wakeups */
ResetLatch(&MyProc->procLatch);
@@ -279,9 +268,25 @@ BackgroundWriterMain(void)
}
/*
- * Do one cycle of dirty-buffer writing.
+ * Sleep untill signalled by backend or LOG_SNAPSHOT_INTERVAL_MS has
+ * elapsed.
+ *
+ * Backend will signal bgwriter when the number of buffers in
+ * freelist fall below than low threshhold of freelist. We need
+ * to wake bgwriter after LOG_SNAPSHOT_INTERVAL_MS to ensure that
+ * it can log information about xl_running_xacts.
*/
- can_hibernate = BgBufferSync();
+ if (XLogStandbyInfoActive() && !RecoveryInProgress())
+ rc = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ LOG_SNAPSHOT_INTERVAL_MS);
+ else
+ rc = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1);
+
+ if (rc & WL_LATCH_SET)
+ BgBufferSyncAndMoveBuffersToFreelist();
/*
* Send off activity statistics to the stats collector
@@ -318,7 +323,9 @@ BackgroundWriterMain(void)
* Checkpointer, when active, is barely ever in its mainloop and thus
* makes it hard to log regularly.
*/
- if (XLogStandbyInfoActive() && !RecoveryInProgress())
+ if ((rc & WL_TIMEOUT || rc & WL_LATCH_SET) &&
+ XLogStandbyInfoActive() &&
+ !RecoveryInProgress())
{
TimestampTz timeout = 0;
TimestampTz now = GetCurrentTimestamp();
@@ -339,57 +346,11 @@ BackgroundWriterMain(void)
}
/*
- * Sleep until we are signaled or BgWriterDelay has elapsed.
- *
- * Note: the feedback control loop in BgBufferSync() expects that we
- * will call it every BgWriterDelay msec. While it's not critical for
- * correctness that that be exact, the feedback loop might misbehave
- * if we stray too far from that. Hence, avoid loading this process
- * down with latch events that are likely to happen frequently during
- * normal operation.
- */
- rc = WaitLatch(&MyProc->procLatch,
- WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay /* ms */ );
-
- /*
- * If no latch event and BgBufferSync says nothing's happening, extend
- * the sleep in "hibernation" mode, where we sleep for much longer
- * than bgwriter_delay says. Fewer wakeups save electricity. When a
- * backend starts using buffers again, it will wake us up by setting
- * our latch. Because the extra sleep will persist only as long as no
- * buffer allocations happen, this should not distort the behavior of
- * BgBufferSync's control loop too badly; essentially, it will think
- * that the system-wide idle interval didn't exist.
- *
- * There is a race condition here, in that a backend might allocate a
- * buffer between the time BgBufferSync saw the alloc count as zero
- * and the time we call StrategyNotifyBgWriter. While it's not
- * critical that we not hibernate anyway, we try to reduce the odds of
- * that by only hibernating when BgBufferSync says nothing's happening
- * for two consecutive cycles. Also, we mitigate any possible
- * consequences of a missed wakeup by not hibernating forever.
- */
- if (rc == WL_TIMEOUT && can_hibernate && prev_hibernate)
- {
- /* Ask for notification at next buffer allocation */
- StrategyNotifyBgWriter(&MyProc->procLatch);
- /* Sleep ... */
- rc = WaitLatch(&MyProc->procLatch,
- WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
- BgWriterDelay * HIBERNATE_FACTOR);
- /* Reset the notification request in case we timed out */
- StrategyNotifyBgWriter(NULL);
- }
-
- /*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*/
if (rc & WL_POSTMASTER_DEATH)
exit(1);
-
- prev_hibernate = can_hibernate;
}
}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3ab1428..d82667b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5020,6 +5020,7 @@ pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
globalStats.buf_written_backend += msg->m_buf_written_backend;
globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
globalStats.buf_alloc += msg->m_buf_alloc;
+ globalStats.buf_freelist += msg->m_buf_freelist;
}
/* ----------
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07ea665..5b8975b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1637,10 +1637,75 @@ BgBufferSync(void)
}
/*
+ * Write out some dirty buffers in the pool and maintain enough
+ * number of buffers in freelist (equal to high threshold for
+ * freelsit), so that backend's don't need to perform clock sweep
+ * often.
+ *
+ * This is called by the background writer process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgBufferSyncAndMoveBuffersToFreelist(void)
+{
+ volatile uint32 next_to_clean;
+ uint32 num_to_free;
+ uint32 tmp_num_to_free;
+ uint32 save_next_to_clean;
+ uint32 recent_alloc;
+ int num_written;
+ int num_freelist;
+ volatile BufferDesc *bufHdr;
+
+ num_freelist = StrategySyncStartAndEnd(&save_next_to_clean,
+ &num_to_free,
+ &recent_alloc);
+
+ /* Report buffer alloc and buffer freelist counts to pgstat */
+ BgWriterStats.m_buf_alloc += recent_alloc;
+ BgWriterStats.m_buf_freelist += num_freelist;
+
+ /* Make sure we can handle the pin inside SyncOneBuffer */
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ num_written = 0;
+ tmp_num_to_free = num_to_free;
+ next_to_clean = save_next_to_clean;
+
+ /* Execute the LRU scan */
+ while (tmp_num_to_free > 0)
+ {
+ int buffer_state = SyncOneBuffer(next_to_clean, true);
+
+ bufHdr = &BufferDescriptors[next_to_clean];
+
+ /* choose next victim buffer to clean. */
+ StrategySyncNextVictimBuffer(&next_to_clean);
+ if (buffer_state & BUF_WRITTEN)
+ ++num_written;
+ if (buffer_state & BUF_REUSABLE)
+ {
+ if (StrategyMoveBufferToFreeListEnd (bufHdr))
+ tmp_num_to_free--;
+ }
+ }
+
+ BgWriterStats.m_buf_written_clean += num_written;
+
+#ifdef BGW_DEBUG
+ elog(DEBUG1, "bgwriter: recent_alloc=%u num_freelist=%u next_to_clean=%d wrote=%d num_freed=%u",
+ recent_alloc, num_freelist, save_next_to_clean, num_written,
+ num_to_free);
+#endif
+}
+
+/*
* SyncOneBuffer -- process a single buffer during syncing.
*
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * If skip_recently_used is true, we decrement the usage count, so that
+ * we can find reusable buffers in consecutive cycles, also we don't write
+ * currently-pinned buffers, nor buffers marked recently used, as these are
+ * not replacement candidates.
*
* Returns a bitmask containing the following flag bits:
* BUF_WRITTEN: we wrote the buffer.
@@ -1673,7 +1738,13 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
result |= BUF_REUSABLE;
else if (skip_recently_used)
{
- /* Caller told us not to write recently-used buffers */
+ /*
+ * Caller told us not to write recently-used buffers and
+ * reduce usage count, so that it can find the reusable
+ * buffers in consecutive cycles.
+ */
+ if (bufHdr->refcount == 0 && bufHdr->usage_count > 0)
+ bufHdr->usage_count--;
UnlockBufHdr(bufHdr);
return result;
}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..358f35c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
+ int numFreeListBuffers; /* number of buffers on freelist */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,7 +44,21 @@ typedef struct
uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
- * Notification latch, or NULL if none. See StrategyNotifyBgWriter.
+ * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+ * numFreeListBuffers, BufferDesc->freeNext).
+ */
+ slock_t freelist_lck;
+
+ /*
+ * Protects nextVictimBuffer. We need separate lock to protect
+ * victim buffer so that clock sweep of one backend doesn't
+ * contend with another backend which is evicting buffer from
+ * freelist.
+ */
+ slock_t victimbuf_lck;
+
+ /*
+ * Latch to wake bgwriter.
*/
Latch *bgwriterLatch;
} BufferStrategyControl;
@@ -112,7 +127,6 @@ volatile BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
{
volatile BufferDesc *buf;
- Latch *bgwriterLatch;
int trycounter;
/*
@@ -129,76 +143,92 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
}
}
- /* Nope, so lock the freelist */
- *lock_held = true;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
-
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We count buffer allocation requests so that the bgwriter can know
+ * the rate of buffer consumption and report it as stats. Note that
+ * buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
StrategyControl->numBufferAllocs++;
+ *lock_held = false;
/*
- * If bgwriterLatch is set, we need to waken the bgwriter, but we should
- * not do so while holding BufFreelistLock; so release and re-grab. This
- * is annoyingly tedious, but it happens at most once per bgwriter cycle,
- * so the performance hit is minimal.
+ * Ideally numFreeListBuffers should get called under freelist spinlock,
+ * however here we need this number for estimating approximate number of
+ * free buffers required on freelist, so it should not be a problem, even
+ * if numFreeListBuffers is not exact. bgwriterLatch is initialized in
+ * early phase of BgWriter startup, however we still check before using
+ * it to avoid any problem incase we reach here before its initializion.
*/
- bgwriterLatch = StrategyControl->bgwriterLatch;
- if (bgwriterLatch)
- {
- StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }
+ if (StrategyControl->numFreeListBuffers < freelistLowThreshold &&
+ StrategyControl->bgwriterLatch)
+ SetLatch(StrategyControl->bgwriterLatch);
/*
* Try to get a buffer from the freelist. Note that the freeNext fields
- * are considered to be protected by the BufFreelistLock not the
+ * are considered to be protected by the freelist_lck not the
* individual buffer spinlocks, so it's OK to manipulate them without
- * holding the spinlock.
+ * holding the buffer spinlock.
*/
- while (StrategyControl->firstFreeBuffer >= 0)
+ for(;;)
{
- buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
- Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
- /* Unconditionally remove buffer from freelist */
- StrategyControl->firstFreeBuffer = buf->freeNext;
- buf->freeNext = FREENEXT_NOT_IN_LIST;
+ if (StrategyControl->firstFreeBuffer >= 0)
+ {
+ buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+ Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
- /*
- * If the buffer is pinned or has a nonzero usage_count, we cannot use
- * it; discard it and retry. (This can only happen if VACUUM put a
- * valid buffer in the freelist and then someone else used it before
- * we got to it. It's probably impossible altogether as of 8.3, but
- * we'd better check anyway.)
- */
- LockBufHdr(buf);
- if (buf->refcount == 0 && buf->usage_count == 0)
+ /* Unconditionally remove buffer from freelist */
+ StrategyControl->firstFreeBuffer = buf->freeNext;
+ buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ /*
+ * If the buffer is pinned or has a nonzero usage_count, we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else used it before
+ * we got to it. It's probably impossible altogether as of 8.3, but
+ * we'd better check anyway.)
+ */
+ LockBufHdr(buf);
+ if (buf->refcount == 0 && buf->usage_count == 0)
+ {
+ if (strategy != NULL)
+ AddBufferToRing(strategy, buf);
+ return buf;
+ }
+ UnlockBufHdr(buf);
+ }
+ else
{
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- return buf;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ break;
}
- UnlockBufHdr(buf);
}
/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
+ /**lock_held = true;
+ LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);*/
+
for (;;)
{
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
if (++StrategyControl->nextVictimBuffer >= NBuffers)
{
StrategyControl->nextVictimBuffer = 0;
- StrategyControl->completePasses++;
+ /*StrategyControl->completePasses++;*/
}
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +271,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
* It is possible that we are told to put something in the freelist that
@@ -253,11 +283,50 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+ ++StrategyControl->numFreeListBuffers;
}
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
+
+/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+ bool freed = false;
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ /*
+ * It is possible that we are told to put something in the freelist that
+ * is already in it; don't screw up the list if so.
+ */
+ if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+ {
+ ++StrategyControl->numFreeListBuffers;
+ freed = true;
+ /*
+ * put the buffer on end of list and if list is empty then
+ * assign first and last freebuffer with this buffer id.
+ */
+ buf->freeNext = FREENEXT_END_OF_LIST;
+ if (StrategyControl->firstFreeBuffer < 0)
+ {
+ StrategyControl->firstFreeBuffer = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ return freed;
+ }
+ BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return freed;
}
+
/*
* StrategySyncStart -- tell BufferSync where to start syncing
*
@@ -274,8 +343,10 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
int result;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
result = StrategyControl->nextVictimBuffer;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
if (complete_passes)
*complete_passes = StrategyControl->completePasses;
if (num_buf_alloc)
@@ -283,11 +354,69 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*num_buf_alloc = StrategyControl->numBufferAllocs;
StrategyControl->numBufferAllocs = 0;
}
- LWLockRelease(BufFreelistLock);
return result;
}
/*
+ * StrategySyncStartAndEnd -- tell BgWriter where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ *
+ * In addition, we return the number of of buffers on freelist.
+ */
+int
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+ int curfreebuffers;
+
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+ *start = StrategyControl->nextVictimBuffer;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+ /*
+ * Ideally numFreeListBuffers should get called under freelist spinlock,
+ * however here we need this number for estimating approximate number of
+ * free buffers required on freelist, so it should not be a problem, even
+ * if numFreeListBuffers is not exact.
+ */
+
+ curfreebuffers = StrategyControl->numFreeListBuffers;
+ if (curfreebuffers < freelistHighThreshold)
+ *end = freelistHighThreshold - curfreebuffers;
+ else
+ *end = 0;
+
+ /*
+ * We need numBufferAllocs just for statistics purpose, so getting
+ * the number with lock.
+ */
+ if (num_buf_alloc)
+ {
+ *num_buf_alloc = StrategyControl->numBufferAllocs;
+ StrategyControl->numBufferAllocs = 0;
+ }
+
+ return curfreebuffers;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell BgWriter which next unused
+ * buffer to look for syncing.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+ if (++StrategyControl->nextVictimBuffer >= NBuffers)
+ StrategyControl->nextVictimBuffer = 0;
+ *next_victim_buffer = StrategyControl->nextVictimBuffer;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
* StrategyNotifyBgWriter -- set or clear allocation notification latch
*
* If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -309,6 +438,12 @@ StrategyNotifyBgWriter(Latch *bgwriterLatch)
}
+void
+StrategyInitBgWriterLatch(Latch *bgwriterLatch)
+{
+ StrategyControl->bgwriterLatch = bgwriterLatch;
+}
+
/*
* StrategyShmemSize
*
@@ -376,6 +511,7 @@ StrategyInitialize(bool init)
*/
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ StrategyControl->numFreeListBuffers = NBuffers;
/* Initialize the clock sweep pointer */
StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +522,43 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwriterLatch = NULL;
+ SpinLockInit(&StrategyControl->freelist_lck);
+ SpinLockInit(&StrategyControl->victimbuf_lck);
}
else
Assert(!init);
+
+ /*
+ * Initialize the low and high threshold number of buffer's
+ * for freelist. This is used to maintain buffer's on freelist
+ * so that backend doesn't often need to perform clock sweep to
+ * find the buffer.
+ */
+ if (NBuffers > 100000)
+ {
+ freelistLowThreshold = 200;
+ freelistHighThreshold = 2000;
+ }
+ else if (NBuffers > 10000)
+ {
+ freelistLowThreshold = 100;
+ freelistHighThreshold = 1000;
+ }
+ else if (NBuffers > 1000)
+ {
+ freelistLowThreshold = 50;
+ freelistHighThreshold = 200;
+ }
+ else if (NBuffers > 100)
+ {
+ freelistLowThreshold = 30;
+ freelistHighThreshold = 75;
+ }
+ else
+ {
+ freelistLowThreshold = 5;
+ freelistHighThreshold = 15;
+ }
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0892533..2b55bca 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -397,6 +397,7 @@ typedef struct PgStat_MsgBgWriter
PgStat_Counter m_buf_written_backend;
PgStat_Counter m_buf_fsync_backend;
PgStat_Counter m_buf_alloc;
+ PgStat_Counter m_buf_freelist;
PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
PgStat_Counter m_checkpoint_sync_time;
} PgStat_MsgBgWriter;
@@ -545,7 +546,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9C
+#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9D
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -670,6 +671,7 @@ typedef struct PgStat_GlobalStats
PgStat_Counter buf_written_backend;
PgStat_Counter buf_fsync_backend;
PgStat_Counter buf_alloc;
+ PgStat_Counter buf_freelist;
TimestampTz stat_reset_timestamp;
} PgStat_GlobalStats;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..54a8b8f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -161,6 +161,16 @@ typedef struct sbufdesc
#define FREENEXT_NOT_IN_LIST (-2)
/*
+ * Threshold indicators for maintaining buffers on freelist. When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgwriter wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/*
* Macros for acquiring/releasing a shared buffer header's spinlock.
* Do not apply these to local buffers!
*
@@ -188,11 +198,16 @@ extern BufferDesc *LocalBufferDescriptors;
extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
bool *lock_held);
extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf);
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern int StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+ uint32 *num_buf_alloc);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgWriterLatch(Latch *bgwriterLatch);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..b0e5598 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
extern bool BgBufferSync(void);
+extern void BgBufferSyncAndMoveBuffersToFreelist(void);
extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d588b14..cd26ff0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,7 +136,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
*/
/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 16
+#define NUM_BUFFER_PARTITIONS 128
/* Number of partitions the shared lock tables are divided into */
#define LOG2_NUM_LOCK_PARTITIONS 4
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have improved the patch by making following changes:
a. Improved the bgwriter logic to log for xl_running_xacts info and
removed the hibernate logic as bgwriter will now work only when
there is scarcity of buffer's in free list. Basic idea is when the
number of buffers on freelist drops below the low threshold, the
allocating backend sets the latch and bgwriter wakesup and beginadding buffer's to freelist until it reaches high threshold and then
again goes back to sleep.
This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.
A comparison of BgBufferSync() with BgBufferSyncAndMoveBuffersToFreelist()
reveals that you've removed at least one behavior that some people (at
least, me) will care about, which is the guarantee that the background
writer will scan the entire buffer pool at least every couple of minutes.
This is important because it guarantees that dirty data doesn't sit in
memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.
b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUG
If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense. I think what you should be counting is
the number of allocations that are being satisfied from the free-list.
Then, by comparing the rate at which that value is incrementing to the rate
at which buffers_alloc is incrementing, somebody can figure out what
percentage of allocations are requiring a clock-sweep run. Actually, I
think it's better to flip it around: count the number of allocations that
require an individual backend to run the clock sweep (vs. being satisfied
from the free-list); call it, say, buffers_backend_clocksweep. We can then
try to tune the patch to make that number as small as possible under
varying workloads.
c. Used the already existing bgwriterLatch in BufferStrategyControl to
wake bgwriter when number of buffer's in freelist drops below
threshold.
Seems like a good idea.
d. Autotune the low and high threshold for freelist for various
configurations. Generally if keep small number (200~2000) of buffers
always available on freelist, then even for high shared buffers
like 15GB, it appears to be sufficient. However when the value
of shared buffer's is less, then we need much smaller number. I
think we can provide these as config knobs for user as well, but for
now based on LWLOCK_STATS result, I have chosen some hard
coded values for low and high threshold values for freelist.
Values for low and high threshold have been decided based on total
number of shared buffers, basically I have divided them into 5
categories (16~100, 100~1000, 1000~10000, 10000~100000,
100000 and above) and then ran tests(read-only pgbench) for various
configurations falling under these categories. The reason for keeping
lesser categories for larger shared buffers is that if there are small
number (200~2000) of buffers available on free list, then it seems to
be sufficient for quite high loads, however as the total number of
shared
buffer's decreases we need to be more careful as if we keep the
number as
too low then it will lead to more clock sweep by backends (which means
freelist lock contention) and if we keep number higher bgwriter will
evict
many useful buffers. Results based on LWLOCK_STATS is at end of mail.
I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants. And it definitely needs some comments
explaining the logic behind the choices.
Aside from those specific remarks, I think the elephant in the room is the
question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.
But the other problem, which I think is likely unsolvable, is that writing
a dirty page can take a long time on a busy system (multiple seconds) and
the freelist can be emptied much, much quicker than that (milliseconds).
Although your benchmark results show great speed-ups on read-only
workloads, we're not really going to get the benefit consistently on
read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.
I'm wondering if it would be a whole lot simpler and better to introduce a
new background process, maybe with a name like bgreclaim. That process
wouldn't write dirty buffers. Instead, it would just run the clock sweep
(i.e. the last loop inside StrategyGetBuffer) and put the buffers onto the
free list. Then, we could leave the bgwriter logic more or less intact.
It certainly needs improvement, but that could be another patch.
Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.
Then, in StrategyGetBuffer, acquire the freelist_lck at the point where the
LWLock is acquired today. Increment StrategyControl->numBufferAllocs; save
the values of StrategyControl->bgwriterLatch; pop a buffer off the freelist
if there is one, saving its identity. Release the spinlock. Then, set the
bgwriterLatch if needed. In the first loop, first check whether the buffer
we previously popped from the freelist is pinned or has a non-zero usage
count and return it if not, holding the buffer header lock. Otherwise,
reacquire the spinlock just long enough to pop a new potential victim and
then loop around.
Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.
I have kept it just for the reason that if the basic approach is
sounds reasonable/accepted, then I will clean it up. Sorry for
the inconvenience, I didn't realized that it can be annoying for
reviewer, I will remove all such code from patch in next version.
A comparison of BgBufferSync() with
BgBufferSyncAndMoveBuffersToFreelist() reveals that you've removed at least
one behavior that some people (at least, me) will care about, which is the
guarantee that the background writer will scan the entire buffer pool at
least every couple of minutes.
Okay, I will take care of this based on the conclusion of
the other points in this mail.
This is important because it guarantees that dirty data doesn't sit in
memory forever. When the system becomes busy again after a long idle
period, users will expect the system to have used the idle time to flush
dirty buffers to disk. This also improves data recovery prospects if, for
example, somebody loses their pg_xlog directory - there may be dirty
buffers whose contents are lost, of course, but they won't be months old.
b. New stats for number of buffers on freelist has been added, some
old one's like maxwritten_clean can be removed as new logic for
syncing buffers and moving them to free list doesn't use them.
However I think it's better to remove them once the new logic is
accepted. Added some new logs for info related to free list under
BGW_DEBUGIf I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.
I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.
I think what you should be counting is the number of allocations that are
being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.
This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?
d. Autotune the low and high threshold for freelist for various
configurations.I think we need to come up with some kind of formula here rather than
just a list of hard-coded constants.
That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.
There are other instances in code as well (one of them I remember
offhand is in pglz_compress) where we use fixed values based on
different sizes.
And it definitely needs some comments explaining the logic behind the
choices.
Agreed, I shall improve them in next version of patch.
Aside from those specific remarks, I think the elephant in the room is
the question of whether it really makes sense to have one process which is
responsible both for populating the free list and for writing buffers to
disk. One problem, which I alluded to above under point (1), is that we
might sometimes want to ensure that dirty buffers are written out to disk
without decrementing usage counts or adding anything to the free list.
This is a potentially solvable problem, though, because we can figure out
the number of buffers that we need to scan for freelist population and the
number that we need to scan for minimum buffer pool cleaning (one cycle
every 2 minutes). Once we've met the first goal, any further buffers we
run into under the second goal get cleaned if appropriate but their usage
counts don't get pushed down nor do they get added to the freelist. Once
we meet the second goal, we can go back to sleep.
But the other problem, which I think is likely unsolvable, is that
writing a dirty page can take a long time on a busy system (multiple
seconds) and the freelist can be emptied much, much quicker than that
(milliseconds). Although your benchmark results show great speed-ups on
read-only workloads, we're not really going to get the benefit consistently
on read-write workloads -- unless of course the background writer fails to
actually write anything, which should be viewed as a bug, not a feature --
because the freelist will often be empty while the background writer is
blocked on I/O.
I'm wondering if it would be a whole lot simpler and better to introduce
a new background process, maybe with a name like bgreclaim.
That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.
That process wouldn't write dirty buffers.
If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist. Few options to
handle such a case are:
a. Skip such a buffer; the downside is if we have to skip lot
of buffers due to this reason then having separate process
such as bgreclaim will be less advantageous.
b. Skip the buffer and notify bgwriter to flush buffers, now this
notification can be sent either as soon as we encounter one
such buffer or after few such buffers (incase of few, we need to decide
some useful number). In this option, there is a chance that bgwriter
decide not to flush buffer/'s which ideally should not happen because
I think bgwriter considers the number of recent allocations for
performing scan to flush dirt buffers.
c. Have some mechanism where bgreclaim can notify bgwriter
to flush some specific buffers. I think if we have such a mechanism
that can be later even used by backends if required.
d. Keep the logic as per current patch and improve such that it can
retain the behaviour of one cycle per two minutes as suggested above
by you on the basis that in anycase it is better than the current code.
I don't think option (d) is best way to handle this scenario,however I
kept it incase nothing else sounds reasonable. Option (c) might have
lot of work which I am not sure is justifiable to handle the current
scenario,
though it can be useful for some other things. Option (a) should be okay
for most cases, but I think option (b) would be better.
Instead, it would just run the clock sweep (i.e. the last loop inside
StrategyGetBuffer) and put the buffers onto the free list.
Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.
Then, we could leave the bgwriter logic more or less intact. It certainly
needs improvement, but that could be another patch.
Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK.
I have kept them outside spinlock because as per patch the only
callsite for setting StrategyControl->bgwriterLatch is StrategyGetBuffer()
and StrategyControl->numBufferAllocs is used just for statistics purpose
(which I thought might be okay even if it is not accurate) whereas without
patch it is used by bgwriter for purpose other than stats as well.
However it certainly needs to be protected for separate bgreclaim process
idea or for retaining current bgwriter behaviour.
I think you should get rid of BufFreelistLock completely and just decide
that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.
Then, in StrategyGetBuffer, acquire the freelist_lck at the point where
the LWLock is acquired today. Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity. Release the spinlock.
Then, set the bgwriterLatch if needed. In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock. Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.
I shall take care of doing this way in next version of patch.
Under this locking strategy, StrategyNotifyBgWriter would use
freelist_lck. Right now, the patch removes the only caller, and should
therefore remove the function as well, but if we go with the new-process
idea listed above that part would get reverted, and then you'd need to make
it use the correct spinlock. You should also go through this patch and
remove all the commented-out bits and pieces that you haven't cleaned up;
those are distracting and unhelpful.
Sure.
Thank you for review.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 6, 2014 at 6:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
If I'm reading this right, the new statistic is an incrementing counter
where, every time you update it, you add the number of buffers currently on
the freelist. That makes no sense.I think using 'number of buffers currently on the freelist' and
'number of recently allocated buffers' for consecutive cycles,
we can figure out approximately how many buffer allocations
needs clock sweep assuming low and high threshold water
marks are fixed. However there can be cases where it is not
easy to estimate that number.
Counters should be design in such a way that you can read it, and then
read it again later, and make sense of it - you should not need to
read the counter on *consecutive* cycles to interpret it.
I think what you should be counting is the number of allocations that are
being satisfied from the free-list. Then, by comparing the rate at which
that value is incrementing to the rate at which buffers_alloc is
incrementing, somebody can figure out what percentage of allocations are
requiring a clock-sweep run. Actually, I think it's better to flip it
around: count the number of allocations that require an individual backend
to run the clock sweep (vs. being satisfied from the free-list); call it,
say, buffers_backend_clocksweep. We can then try to tune the patch to make
that number as small as possible under varying workloads.This can give us clear idea to tune the patch, however we need
to maintain 3 counters for it in code(recent_alloc (needed for
current bgwriter logic) and other 2 suggested by you). Do you
want to retain such counters in code or it's for kind of debug info
for patch?
I only mean to propose one new counter, and I'd imagine including that
in the final patch. We already have a counter of total buffer
allocations; that's buffers_alloc. I'm proposing to add an additional
counter for the number of those allocations not satisfied from the
free list, with a name like buffers_alloc_clocksweep (I said
buffers_backend_clocksweep above, but that's probably not best, as the
existing buffers_backend counts buffer *writes*, not allocations). I
think we would definitely want to retain this counter in the final
patch, as an additional column in pg_stat_bgwriter.
d. Autotune the low and high threshold for freelist for various
configurations.I think we need to come up with some kind of formula here rather than just
a list of hard-coded constants.That was my initial intention as well and I have tried based
on number of shared buffers like keeping threshold values as
percentage of shared buffers but nothing could satisfy different
kind of workloads. The current values I have choosen are based
on experiments for various workloads at different thresholds. I have
shown the lwlock_stats data for various loads based on current
thresholds upthread. Another way could be to make them as config
knobs and use the values as given by user incase it is provided by
user else go with fixed values.
How did you go about determining the optimal value for a particular workload?
When the list is kept short, it's less likely that a value on the list
will be referenced or dirtied again before the page is actually
recycled. That's clearly good. But when the list is long, it's less
likely to become completely empty and thereby force individual
backends to run the clock-sweep. My suspicion is that, when the
number of buffers is small, the impact of the list being too short
isn't likely to be very significant, because running the clock-sweep
isn't all that expensive anyway - even if you have to scan through the
entire buffer pool multiple times, there aren't that many buffers.
But when the number of buffers is large, those repeated scans can
cause a major performance hit, so having an adequate pool of free
buffers becomes much more important.
I think your list of high-watermarks is far too generous for low
buffer counts. With more than 100k shared buffers, you've got a
high-watermark of 2k buffers, which means that 2% or less of the
buffers will be on the freelist, which seems a little on the high side
to me, but probably in the ballpark of what we should be aiming for.
But at 10001 shared buffers, you can have 1000 of them on the
freelist, which is 10% of the buffer pool; that seems high. At 101
shared buffers, 75% of the buffers in the system can be on the
freelist; that seems ridiculous. The chances of a buffer still being
unused by the time it reaches the head of the freelist seem very
small.
Based on your existing list of thresholds, and taking the above into
account, I'd suggest something like this: let the high-watermark for
the freelist be 0.5% of the total number of buffers, with a maximum of
2000 and a minimum of 5. Let the low-watermark be 20% of the
high-watermark. That might not be best, but I think some kind of
formula like that can likely be made to work. I would suggest
focusing your testing on configurations with *large* settings for
shared_buffers, say 1-64GB, rather than small configurations. Anyone
who cares greatly about performance isn't going to be running with
only 8MB of shared_buffers anyway. Arguably we shouldn't even run the
reclaim process on very small configurations; I think there should
probably a GUC (PGC_SIGHUP) to control whether it gets launched.
I think it would be a good idea to analyze how frequently the reclaim
process gets woken up. In the worst case, this happens once per (high
watermark - low watermark) allocations; that is, the system reaches
the low watermark and then does no further allocations until the
reclaim process brings the freelist back up to the high watermark.
But if more allocations occur between the time the reclaim process is
woken and the time it reaches the high watermark, then it should run
for longer, until the high watermark is reached. At least for
debugging purposes, I think it would be useful to have a counter of
reclaim wakeups. I'm not sure whether that's worth including in the
final patch, but it might be.
That will certainly help in retaining the current behaviour of
bgwriter and make the idea cleaner. I will modify the patch
to have a new background process unless somebody thinks
otherwise.If we go with this approach, one thing which we need to decide
is what to do incase buf which has usage_count as zero is *dirty*,
as I don't think it is good idea to put it in freelist.
I thought a bit about this yesterday. I think the problem is that we
might be in a situation where buffers are being dirtied faster than
they can be cleaned. In that case, if we only put clean buffers on the
freelist, then every backend in the system will be fighting over the
ever-dwindling supply of clean buffers until, in the worst case,
there's maybe only 1 clean buffer which is getting evicted repeatedly
at top speed - or maybe even no clean buffers, and the reclaim process
just spins in an infinite loop looking for clean buffers that aren't
there.
To put that another way, the rate at which buffers are being dirtied
can't exceed the rate at which they are being cleaned forever.
Eventually, somebody is going to have to wait. Having the backends
wait by being forced to write some dirty buffers does not seem like a
bad way to accomplish that. So I favor just putting the buffers on
freelist without regard to whether they are clean or dirty. If this
turns out not to work well we can look at other options (probably some
variant of (b) from your list).
Instead, it would just run the clock sweep (i.e. the last loop inside
StrategyGetBuffer) and put the buffers onto the free list.Don't we need to do more than just last loop inside StrategyGetBuffer(),
as clock sweep in strategy get buffer is responsible for getting one
buffer with usage_count = 0 where as we need to run the loop till it
finds and moves enough such buffers so that it can populate freelist
with number of buffers equal to high water mark of freelist.
Yeah, that's what I meant. Of course, it should add each buffer to
the freelist individually, not batch them up and add them all at once.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 5, 2014 at 4:43 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what you've
changed. I realize you probably left it that way for testing purposes, but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out, so
that the scope of the changes you've made is clear to reviewers.
FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 13, 2014 at 2:32 AM, Andres Freund <andres@2ndquadrant.com>
wrote:
On 2014-08-06 15:42:08 +0530, Amit Kapila wrote:
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com>
wrote:
This essentially removes BgWriterDelay, but it's still mentioned in
BgBufferSync(). Looking further, I see that with the patch applied,
BgBufferSync() is still present in the source code but is no longer
called
from anywhere. Please don't submit patches that render things unused
without actually removing them; it makes it much harder to see what
you've
changed. I realize you probably left it that way for testing purposes,
but
you need to clean such things up before submitting. Likewise, if you've
rendered GUCs or statistics counters removed, you need to rip them out,
so
that the scope of the changes you've made is clear to reviewers.
FWIW, I found this email amost unreadable because it misses quoting
signs after linebreaks in quoted content.
I think I have done something wrong while replying to Robert's
mail, the main point in that mail was trying to see if there is any
major problem incase we have separate process (bgreclaim) to
populate freelist. One thing which I thought could be problematic
is to put a buf in freelist which has usage_count as zero and is *dirty*.
Please do let me know if you want clarification for something in
particular.
Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.
There are other things also which I need to take care as per
feedback like some change in locking strategy and code.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.
I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
Overall, the main changes required in patch as per above feedback
are:
1. add an additional counter for the number of those
allocations not satisfied from the free list, with a
name like buffers_alloc_clocksweep.
2. Autotune the low and high threshold values for buffers
in freelist. In the patch, I have kept them as hard-coded
values.
3. For populating freelist, have a separate process (bgreclaim)
instead of doing it by bgwriter.I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.
Doing both (populating freelist and flushing dirty buffers) via bgwriter
isn't the best way either because it might not be able to perform
both the jobs as per need.
One example is it could take much longer time to flush a dirty buffer
than to move it into free list, so if there are few buffers which we need
to flush, then I think task of maintaining buffers in freelist will get hit
even though there are buffers in list which can be moved to
free list(non-dirty buffers).
Another is maintaining the current behaviour of bgwriter which is to scan
the entire buffer pool every few mins (assuming default configuration).
We can attempt to solve this problem as suggested by Robert upthread
but I am not completely sure if that can guarantee that the current
behaviour will be retained as it is.
I am not telling that having a separate process won't have any issues,
but I think we can tackle them without changing or complicating current
bgwriter behaviour.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Incidentally, while I generally think your changes to the locking regimen
in StrategyGetBuffer() are going in the right direction, they need
significant cleanup. Your patch adds two new spinlocks, freelist_lck and
victimbuf_lck, that mostly but not-quite replace BufFreelistLock, and
you've now got StrategyGetBuffer() running with no lock at all when
accessing some things that used to be protected by BufFreelistLock;
specifically, you're doing StrategyControl->numBufferAllocs++ and
SetLatch(StrategyControl->bgwriterLatch) without any locking. That's not
OK. I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.
Then, in StrategyGetBuffer, acquire the freelist_lck at the point where
the LWLock is acquired today. Increment StrategyControl->numBufferAllocs;
save the values of StrategyControl->bgwriterLatch; pop a buffer off the
freelist if there is one, saving its identity. Release the spinlock.
Then, set the bgwriterLatch if needed. In the first loop, first check
whether the buffer we previously popped from the freelist is pinned or has
a non-zero usage count and return it if not, holding the buffer header
lock. Otherwise, reacquire the spinlock just long enough to pop a new
potential victim and then loop around.
Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.
I thought it is better to mention about above points so that if you have
any different thoughts about it, then it is better to discuss them now
rather than after I take performance data with this locking protocol.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes:
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.
I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.
A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).
But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently. For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.
(And yes, I realize these issues already affect the freelist. Perhaps
that's part of the reason we have performance issues with it.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 26, 2014 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.
In some cases, it could be beneficial especially when a,b,c are
going to be more frequently accessed as compare to x,y,z.
A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently.
I think patch will reduce the contention for some of such variables
(which are accessed during clock sweep) as it will minimize the need
to perform clock sweep by backends.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Aug 26, 2014 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
On Tue, Aug 5, 2014 at 9:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think you should get rid of BufFreelistLock completely and just
decide that freelist_lck will protect everything the freeNext links, plus
everything in StrategyControl except for nextVictimBuffer. victimbuf_lck
will protect nextVictimBuffer and nothing else.Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.I'm rather concerned by this cavalier assumption that we can protect
fields a,b,c with one lock and fields x,y,z in the same struct with some
other lock.A minimum requirement for that to work safely at all is that the fields
are of atomically fetchable/storable widths; which might be okay here
but it's a restriction that bears thinking about (and documenting).But quite aside from safety, the fields are almost certainly going to
be in the same cache line which means contention between processes that
are trying to fetch or store them concurrently. For a patch whose sole
excuse for existence is to improve performance, that should be a very
scary concern.(And yes, I realize these issues already affect the freelist. Perhaps
that's part of the reason we have performance issues with it.)
False sharing is certainly a concern that has crossed my mine while
looking at Amit's work, but the performance numbers he's posted
upthread are stellar. Maybe we can squeeze some additional
performance out of this by padding out the cache lines, but it's
probably minor compared to the gains he's already seeing. I think we
should focus on trying to lock in those gains, and then we can
consider what further things we may want to do after that. If it
turns out that structure-padding is among those things, that's easy
enough to do as a separate patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.
Sounds reasonable. I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 27, 2014 at 8:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 26, 2014 at 10:53 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:Today, while working on updating the patch to improve locking
I found that as now we are going to have a new process, we need
a separate latch in StrategyControl to wakeup that process.
Another point is I think it will be better to protect
StrategyControl->completePasses with victimbuf_lck rather than
freelist_lck, as when we are going to update it we will already be
holding the victimbuf_lck and it doesn't make much sense to release
the victimbuf_lck and reacquire freelist_lck to update it.Sounds reasonable. I think the key thing at this point is to get a
new version of the patch with the background reclaim running in a
different process than the background writer. I don't see much point
in fine-tuning the locking regimen until that's done.
I have updated the patch to address the feedback. Main changes are:
1. For populating freelist, have a separate process (bgreclaimer)
instead of doing it by bgwriter.
2. Autotune the low and high threshold values for buffers
in freelist. I have used the formula as suggested by you upthread.
3. Cleanup of locking regimen as discussed upthread (completely
eliminated BufFreelist Lock).
4. Improved comments and general code cleanup.
I have not yet added statistics (buffers_backend_clocksweep) as
for that we need to add one more variable in BufferStrategyControl
structure where I have already added few variables for this patch.
I think it is important to have such a stat available via
pg_stat_bgwriter, but not sure if it is worth to make the structure
bit more bulky.
Another minor point is about changes in lwlock.h
lwlock.h
* if you remove a lock, consider leaving a gap in the numbering
* sequence for the benefit of DTrace and other external debugging
* scripts.
As I have removed BufFreelist lock, I have adjusted the numbering
as well in lwlock.h. There is a meesage on top of lock definitions
which suggest to leave gap if we remove any lock, however I was not
sure whether this case (removing the first element) can effect anything,
so for now, I have adjusted the numbering.
I have yet to collect data under varying loads, however I have
collected performance data for 8GB shared buffers which shows
reasonably good performance and scalability.
I think the main part left for this patch is more data for various loads
which I will share in next few days, however I think patch is ready for
next round of review, so I will mark it as Needs Review.
Performance Data:
-------------------------------
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_ver 8 16 32 64 128 HEAD 58614 107370 140717 104357
65010 Patch 60849 118701 165631 209226 213029
Note -
a. The numbers are slightly different than previously reported
numbers as earlier I was using debug mode of binaries to take
data and it seems some kind of trace was enabled on m/c.
However the improve in performance and scalability is almost
similar to previous.
b. Above data is median of 3 runs, for detailed data refer attached
document (perf_read_scalability_data_v5.ods)
CPU Usage
------------------
I have observed that CPU usage for new process (reclaimer) is
between 5~9%.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
scalable_buffer_eviction_v5.patchapplication/octet-stream; name=scalable_buffer_eviction_v5.patchDownload
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..38698b0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -27,6 +27,7 @@
#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "pg_getopt.h"
+#include "postmaster/bgreclaimer.h"
#include "postmaster/bgwriter.h"
#include "postmaster/startup.h"
#include "postmaster/walwriter.h"
@@ -179,7 +180,8 @@ static IndexList *ILHead = NULL;
* AuxiliaryProcessMain
*
* The main entry point for auxiliary processes, such as the bgwriter,
- * walwriter, walreceiver, bootstrapper and the shared memory checker code.
+ * walwriter, walreceiver, bgreclaimer, bootstrapper and the shared
+ * memory checker code.
*
* This code is here just because of historical reasons.
*/
@@ -323,6 +325,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
case WalReceiverProcess:
statmsg = "wal receiver process";
break;
+ case BgReclaimerProcess:
+ statmsg = "reclaimer process";
+ break;
default:
statmsg = "??? process";
break;
@@ -437,6 +442,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
WalReceiverMain();
proc_exit(1); /* should never return */
+ case BgReclaimerProcess:
+ /* don't set signals, bgreclaimer has its own agenda */
+ BackgroundReclaimerMain();
+ proc_exit(1); /* should never return */
+
default:
elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
proc_exit(1);
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c2321..168d0d8 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/postmaster
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
- pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+OBJS = autovacuum.o bgreclaimer.o bgworker.o bgwriter.o checkpointer.o \
+ fork_process.o pgarch.o pgstat.o postmaster.o startup.o syslogger.o \
+ walwriter.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgreclaimer.c b/src/backend/postmaster/bgreclaimer.c
new file mode 100644
index 0000000..55dc157
--- /dev/null
+++ b/src/backend/postmaster/bgreclaimer.c
@@ -0,0 +1,302 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.c
+ *
+ * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It
+ * attempts to keep regular backends from having to run clock sweep (which
+ * they would only do when they don't find a usable shared buffer from
+ * freelist to read in another page). In the best scenario all requests
+ * for shared buffers will be fulfilled from freelist as the background
+ * reclaimer process always tries to maintain buffers on freelist. However,
+ * regular backends are still empowered to run clock sweep to find a usable
+ * buffer if the bgreclaimer fails to maintain enough buffers on freelist.
+ *
+ * The bgwriter is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGTERM, which instructs the bgreclaimer to exit(0).
+ * Emergency termination is by SIGQUIT; like any backend, the bgreclaimer will
+ * simply abort and exit on SIGQUIT.
+ *
+ * If the bgreclaimer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/postmaster/bgreclaimer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgreclaimer.h"
+#include "storage/bufmgr.h"
+#include "storage/buf_internals.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/* Signal handlers */
+
+static void bgreclaim_quickdie(SIGNAL_ARGS);
+static void BgreclaimSigHupHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+static void bgreclaim_sigusr1_handler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for bgreclaim process
+ *
+ * This is invoked from AuxiliaryProcessMain, which has already created the
+ * basic execution environment, but not enabled signals yet.
+ */
+void
+BackgroundReclaimerMain(void)
+{
+ sigjmp_buf local_sigjmp_buf;
+ MemoryContext bgreclaim_context;
+
+ /*
+ * If possible, make this process a group leader, so that the postmaster
+ * can signal any child processes too. (bgreclaim probably never has any
+ * child processes, but for consistency we make all postmaster child
+ * processes do this.)
+ */
+#ifdef HAVE_SETSID
+ if (setsid() < 0)
+ elog(FATAL, "setsid() failed: %m");
+#endif
+
+ /*
+ * Properly accept or ignore signals the postmaster might send us.
+ *
+ * bgreclaim doesn't participate in ProcSignal signalling, but a SIGUSR1
+ * handler is still needed for latch wakeups.
+ */
+ pqsignal(SIGHUP, BgreclaimSigHupHandler); /* set flag to read config file */
+ pqsignal(SIGINT, SIG_IGN);
+ pqsignal(SIGTERM, ReqShutdownHandler); /* shutdown */
+ pqsignal(SIGQUIT, bgreclaim_quickdie); /* hard crash time */
+ pqsignal(SIGALRM, SIG_IGN);
+ pqsignal(SIGPIPE, SIG_IGN);
+ pqsignal(SIGUSR1, bgreclaim_sigusr1_handler);
+ pqsignal(SIGUSR2, SIG_IGN);
+
+ /*
+ * Reset some signals that are accepted by postmaster but not here
+ */
+ pqsignal(SIGCHLD, SIG_DFL);
+ pqsignal(SIGTTIN, SIG_DFL);
+ pqsignal(SIGTTOU, SIG_DFL);
+ pqsignal(SIGCONT, SIG_DFL);
+ pqsignal(SIGWINCH, SIG_DFL);
+
+ /* We allow SIGQUIT (quickdie) at all times */
+ sigdelset(&BlockSig, SIGQUIT);
+
+
+ /*
+ * Create a memory context that we will do all our work in. We do this so
+ * that we can reset the context during error recovery and thereby avoid
+ * possible memory leaks. As of now, the memory allocation can be done
+ * only during processing of SIGHUP signal.
+ */
+ bgreclaim_context = AllocSetContextCreate(TopMemoryContext,
+ "Background Reclaim",
+ ALLOCSET_DEFAULT_MINSIZE,
+ ALLOCSET_DEFAULT_INITSIZE,
+ ALLOCSET_DEFAULT_MAXSIZE);
+ MemoryContextSwitchTo(bgreclaim_context);
+
+ /*
+ * If an exception is encountered, processing resumes here.
+ *
+ * See notes in postgres.c about the design of this coding.
+ */
+ if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+ {
+ /* Since not using PG_TRY, must reset error stack by hand */
+ error_context_stack = NULL;
+
+ /* Prevent interrupts while cleaning up */
+ HOLD_INTERRUPTS();
+
+ /* Report the error to the server log */
+ EmitErrorReport();
+
+ /*
+ * These operations are really just a minimal subset of
+ * AbortTransaction(). We don't have very many resources to worry
+ * about in bgreclaim, but we do have buffers and file descriptors.
+ */
+ UnlockBuffers();
+ AtEOXact_Buffers(false);
+ AtEOXact_Files();
+
+ /*
+ * Now return to normal top-level context and clear ErrorContext for
+ * next time.
+ */
+ MemoryContextSwitchTo(bgreclaim_context);
+ FlushErrorState();
+
+ /* Flush any leaked data in the top-level context */
+ MemoryContextResetAndDeleteChildren(bgreclaim_context);
+
+ /* Now we can allow interrupts again */
+ RESUME_INTERRUPTS();
+ }
+
+ /* We can now handle ereport(ERROR) */
+ PG_exception_stack = &local_sigjmp_buf;
+
+ /*
+ * Unblock signals (they were blocked when the postmaster forked us)
+ */
+ PG_SETMASK(&UnBlockSig);
+
+ StrategyInitBgReclaimerLatch(&MyProc->procLatch);
+
+ /*
+ * Loop forever
+ */
+ for (;;)
+ {
+ int rc;
+
+ /* Clear any already-pending wakeups */
+ ResetLatch(&MyProc->procLatch);
+
+ if (got_SIGHUP)
+ {
+ got_SIGHUP = false;
+ ProcessConfigFile(PGC_SIGHUP);
+ }
+ if (shutdown_requested)
+ {
+ /*
+ * From here on, elog(ERROR) should end with exit(1), not send
+ * control back to the sigsetjmp block above
+ */
+ ExitOnAnyError = true;
+ /* Normal exit from the bgwriter is here */
+ proc_exit(0); /* done */
+ }
+
+ /*
+ * Backend will signal bgreclaimer when the number of buffers in
+ * freelist fall below than low threshhold of freelist.
+ */
+ rc = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH,
+ -1);
+
+ if (rc & WL_LATCH_SET)
+ BgMoveBuffersToFreelist();
+
+ /*
+ * Send off activity statistics to the stats collector
+ */
+ pgstat_send_bgwriter();
+
+ /*
+ * Emergency bailout if postmaster has died. This is to avoid the
+ * necessity for manual cleanup of all postmaster children.
+ */
+ if (rc & WL_POSTMASTER_DEATH)
+ exit(1);
+ }
+}
+
+
+/* --------------------------------
+ * signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * bgreclaim_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+bgreclaim_quickdie(SIGNAL_ARGS)
+{
+ PG_SETMASK(&BlockSig);
+
+ /*
+ * We DO NOT want to run proc_exit() callbacks -- we're here because
+ * shared memory may be corrupted, so we don't want to try to clean up our
+ * transaction. Just nail the windows shut and get out of town. Now that
+ * there's an atexit callback to prevent third-party code from breaking
+ * things by calling exit() directly, we have to reset the callbacks
+ * explicitly to make this work as intended.
+ */
+ on_exit_reset();
+
+ /*
+ * Note we do exit(2) not exit(0). This is to force the postmaster into a
+ * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+ * backend. This is necessary precisely because we don't clean up our
+ * shared memory state. (The "dead man switch" mechanism in pmsignal.c
+ * should ensure the postmaster sees this as a crash, too, but no harm in
+ * being doubly sure.)
+ */
+ exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+BgreclaimSigHupHandler(SIGNAL_ARGS)
+{
+ int save_errno = errno;
+
+ got_SIGHUP = true;
+ if (MyProc)
+ SetLatch(&MyProc->procLatch);
+
+ errno = save_errno;
+}
+
+/* SIGTERM: set flag to shutdown and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+ int save_errno = errno;
+
+ shutdown_requested = true;
+ if (MyProc)
+ SetLatch(&MyProc->procLatch);
+
+ errno = save_errno;
+}
+
+/* SIGUSR1: used for latch wakeups */
+static void
+bgreclaim_sigusr1_handler(SIGNAL_ARGS)
+{
+ int save_errno = errno;
+
+ latch_sigusr1_handler();
+
+ errno = save_errno;
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b190cf5..1a34282 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -143,13 +143,13 @@
* authorization phase). This is used mainly to keep track of how many
* children we have and send them appropriate signals when necessary.
*
- * "Special" children such as the startup, bgwriter and autovacuum launcher
- * tasks are not in this list. Autovacuum worker and walsender are in it.
- * Also, "dead_end" children are in it: these are children launched just for
- * the purpose of sending a friendly rejection message to a would-be client.
- * We must track them because they are attached to shared memory, but we know
- * they will never become live backends. dead_end children are not assigned a
- * PMChildSlot.
+ * "Special" children such as the startup, bgwriter, bgreclaimer and
+ * autovacuum launcher tasks are not in this list. Autovacuum worker and
+ * walsender are in it. Also, "dead_end" children are in it: these are
+ * children launched just for the purpose of sending a friendly rejection
+ * message to a would-be client. We must track them because they are attached
+ * to shared memory, but we know they will never become live backends.
+ * dead_end children are not assigned a PMChildSlot.
*
* Background workers that request shared memory access during registration are
* in this list, too.
@@ -243,7 +243,8 @@ static pid_t StartupPID = 0,
AutoVacPID = 0,
PgArchPID = 0,
PgStatPID = 0,
- SysLoggerPID = 0;
+ SysLoggerPID = 0,
+ BgReclaimerPID = 0;
/* Startup/shutdown state */
#define NoShutdown 0
@@ -269,13 +270,13 @@ static bool RecoveryError = false; /* T if WAL recovery failed */
* hot standby during archive recovery.
*
* When the startup process is ready to start archive recovery, it signals the
- * postmaster, and we switch to PM_RECOVERY state. The background writer and
- * checkpointer are launched, while the startup process continues applying WAL.
- * If Hot Standby is enabled, then, after reaching a consistent point in WAL
- * redo, startup process signals us again, and we switch to PM_HOT_STANDBY
- * state and begin accepting connections to perform read-only queries. When
- * archive recovery is finished, the startup process exits with exit code 0
- * and we switch to PM_RUN state.
+ * postmaster, and we switch to PM_RECOVERY state. The background writer,
+ * background reclaimer and checkpointer are launched, while the startup
+ * process continues applying WAL. If Hot Standby is enabled, then, after
+ * reaching a consistent point in WAL redo, startup process signals us again,
+ * and we switch to PM_HOT_STANDBY state and begin accepting connections to
+ * perform read-only queries. When archive recovery is finished, the startup
+ * process exits with exit code 0 and we switch to PM_RUN state.
*
* Normal child backends can only be launched when we are in PM_RUN or
* PM_HOT_STANDBY state. (We also allow launch of normal
@@ -505,6 +506,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
#define StartCheckpointer() StartChildProcess(CheckpointerProcess)
#define StartWalWriter() StartChildProcess(WalWriterProcess)
#define StartWalReceiver() StartChildProcess(WalReceiverProcess)
+#define StartBackgroundReclaimer() StartChildProcess(BgReclaimerProcess)
/* Macros to check exit status of a child process */
#define EXIT_STATUS_0(st) ((st) == 0)
@@ -568,8 +570,8 @@ PostmasterMain(int argc, char *argv[])
* handling setup of child processes. See tcop/postgres.c,
* bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
* postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
- * postmaster/syslogger.c, postmaster/bgworker.c and
- * postmaster/checkpointer.c.
+ * postmaster/syslogger.c, postmaster/bgworker.c, postmaster/bgreclaimer.c
+ * and postmaster/checkpointer.c.
*/
pqinitmask();
PG_SETMASK(&BlockSig);
@@ -1583,7 +1585,8 @@ ServerLoop(void)
/*
* If no background writer process is running, and we are not in a
* state that prevents it, start one. It doesn't matter if this
- * fails, we'll just try again later. Likewise for the checkpointer.
+ * fails, we'll just try again later. Likewise for the checkpointer
+ * and bgreclaimer.
*/
if (pmState == PM_RUN || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY)
@@ -1592,6 +1595,8 @@ ServerLoop(void)
CheckpointerPID = StartCheckpointer();
if (BgWriterPID == 0)
BgWriterPID = StartBackgroundWriter();
+ if (BgReclaimerPID == 0)
+ BgReclaimerPID = StartBackgroundReclaimer();
}
/*
@@ -2330,6 +2335,8 @@ SIGHUP_handler(SIGNAL_ARGS)
signal_child(SysLoggerPID, SIGHUP);
if (PgStatPID != 0)
signal_child(PgStatPID, SIGHUP);
+ if (BgReclaimerPID != 0)
+ signal_child(BgReclaimerPID, SIGHUP);
/* Reload authentication config files too */
if (!load_hba())
@@ -2398,6 +2405,9 @@ pmdie(SIGNAL_ARGS)
/* and the walwriter too */
if (WalWriterPID != 0)
signal_child(WalWriterPID, SIGTERM);
+ /* and the bgreclaimer too */
+ if (BgReclaimerPID != 0)
+ signal_child(BgReclaimerPID, SIGTERM);
/*
* If we're in recovery, we can't kill the startup process
@@ -2440,14 +2450,16 @@ pmdie(SIGNAL_ARGS)
signal_child(BgWriterPID, SIGTERM);
if (WalReceiverPID != 0)
signal_child(WalReceiverPID, SIGTERM);
+ if (BgReclaimerPID != 0)
+ signal_child(BgReclaimerPID, SIGTERM);
SignalUnconnectedWorkers(SIGTERM);
if (pmState == PM_RECOVERY)
{
/*
- * Only startup, bgwriter, walreceiver, unconnected bgworkers,
- * and/or checkpointer should be active in this state; we just
- * signaled the first four, and we don't want to kill
- * checkpointer yet.
+ * Only startup, bgwriter, walreceiver, bgreclaimer,
+ * unconnected bgworkers, and/or checkpointer should be
+ * active in this state; we just signaled the first five,
+ * and we don't want to kill checkpointer yet.
*/
pmState = PM_WAIT_BACKENDS;
}
@@ -2600,6 +2612,8 @@ reaper(SIGNAL_ARGS)
BgWriterPID = StartBackgroundWriter();
if (WalWriterPID == 0)
WalWriterPID = StartWalWriter();
+ if (BgReclaimerPID == 0)
+ BgReclaimerPID = StartBackgroundReclaimer();
/*
* Likewise, start other special children as needed. In a restart
@@ -2625,7 +2639,8 @@ reaper(SIGNAL_ARGS)
/*
* Was it the bgwriter? Normal exit can be ignored; we'll start a new
* one at the next iteration of the postmaster's main loop, if
- * necessary. Any other exit condition is treated as a crash.
+ * necessary. Any other exit condition is treated as a crash. Likewise
+ * for bgreclaimer.
*/
if (pid == BgWriterPID)
{
@@ -2636,6 +2651,17 @@ reaper(SIGNAL_ARGS)
continue;
}
+ if (pid == BgReclaimerPID)
+ {
+ BgReclaimerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("background reclaimer process"));
+ continue;
+ }
+
+
+
/*
* Was it the checkpointer?
*/
@@ -2997,7 +3023,7 @@ CleanupBackend(int pid,
/*
* HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, bgreclaimer or background worker.
*
* The objectives here are to clean up our local state about the child
* process, and to signal all other remaining children to quickdie.
@@ -3201,6 +3227,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+ /* Take care of the bgreclaimer too */
+ if (pid == BgReclaimerPID)
+ BgReclaimerPID = 0;
+ else if (BgReclaimerPID != 0 && take_action)
+ {
+ ereport(DEBUG2,
+ (errmsg_internal("sending %s to process %d",
+ (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ (int) BgReclaimerPID)));
+ signal_child(BgReclaimerPID, (SendStop ? SIGSTOP : SIGQUIT));
+ }
+
/*
* Force a power-cycle of the pgarch process too. (This isn't absolutely
* necessary, but it seems like a good idea for robustness, and it
@@ -3371,14 +3409,14 @@ PostmasterStateMachine(void)
/*
* PM_WAIT_BACKENDS state ends when we have no regular backends
* (including autovac workers), no bgworkers (including unconnected
- * ones), and no walwriter, autovac launcher or bgwriter. If we are
- * doing crash recovery or an immediate shutdown then we expect the
- * checkpointer to exit as well, otherwise not. The archiver, stats,
- * and syslogger processes are disregarded since they are not
- * connected to shared memory; we also disregard dead_end children
- * here. Walsenders are also disregarded, they will be terminated
- * later after writing the checkpoint record, like the archiver
- * process.
+ * ones), and no walwriter, autovac launcher, bgwriter or bgreclaimer.
+ * If we are doing crash recovery or an immediate shutdown then we
+ * expect the checkpointer to exit as well, otherwise not. The
+ * archiver, stats, and syslogger processes are disregarded since they
+ * are not connected to shared memory; we also disregard dead_end
+ * children here. Walsenders are also disregarded, they will be
+ * terminated later after writing the checkpoint record, like the
+ * archiver process.
*/
if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_WORKER) == 0 &&
CountUnconnectedWorkers() == 0 &&
@@ -3388,7 +3426,8 @@ PostmasterStateMachine(void)
(CheckpointerPID == 0 ||
(!FatalError && Shutdown < ImmediateShutdown)) &&
WalWriterPID == 0 &&
- AutoVacPID == 0)
+ AutoVacPID == 0 &&
+ BgReclaimerPID == 0)
{
if (Shutdown >= ImmediateShutdown || FatalError)
{
@@ -3486,6 +3525,7 @@ PostmasterStateMachine(void)
Assert(CheckpointerPID == 0);
Assert(WalWriterPID == 0);
Assert(AutoVacPID == 0);
+ Assert(BgReclaimerPID == 0);
/* syslogger is not considered here */
pmState = PM_NO_CHILDREN;
}
@@ -3698,6 +3738,8 @@ TerminateChildren(int signal)
signal_child(WalReceiverPID, signal);
if (AutoVacPID != 0)
signal_child(AutoVacPID, signal);
+ if (BgReclaimerPID != 0)
+ signal_child(BgReclaimerPID, signal);
if (PgArchPID != 0)
signal_child(PgArchPID, signal);
if (PgStatPID != 0)
@@ -4779,6 +4821,8 @@ sigusr1_handler(SIGNAL_ARGS)
CheckpointerPID = StartCheckpointer();
Assert(BgWriterPID == 0);
BgWriterPID = StartBackgroundWriter();
+ Assert(BgReclaimerPID == 0);
+ BgReclaimerPID = StartBackgroundReclaimer();
pmState = PM_RECOVERY;
}
@@ -5123,6 +5167,10 @@ StartChildProcess(AuxProcType type)
ereport(LOG,
(errmsg("could not fork WAL receiver process: %m")));
break;
+ case BgReclaimerProcess:
+ ereport(LOG,
+ (errmsg("could not fork background writer process: %m")));
+ break;
default:
ereport(LOG,
(errmsg("could not fork process: %m")));
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 1fd38d0..9b47eb2 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -125,14 +125,12 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock.
-* A separate system-wide LWLock, the BufFreelistLock, provides mutual
+* BufferStrategyControl contains a spinlock freelist_lck that provides mutual
exclusion for operations that access the buffer free list or select
-buffers for replacement. This is always taken in exclusive mode since
-there are no read-only operations on those data structures. The buffer
-management policy is designed so that BufFreelistLock need not be taken
-except in paths that will require I/O, and thus will be slow anyway.
-(Details appear below.) It is never necessary to hold the BufMappingLock
-and the BufFreelistLock at the same time.
+buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
+is needed to perform clock sweep which is a longer operation, however now we
+are using two spinklocks freelist_lck and victimbuf_lck to perform operations
+on freelist and run clock sweep respectively.
* Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as
@@ -160,16 +158,18 @@ Normal Buffer Replacement Strategy
There is a "free list" of buffers that are prime candidates for replacement.
In particular, buffers that are completely free (contain no valid page) are
-always in this list. We could also throw buffers into this list if we
-consider their pages unlikely to be needed soon; however, the current
-algorithm never does that. The list is singly-linked using fields in the
+always in this list. We also throw buffers into this list if we consider
+their pages unlikely to be needed soon; this is done by background process
+reclaimer. The list is singly-linked using fields in the
buffer headers; we maintain head and tail pointers in global variables.
(Note: although the list links are in the buffer headers, they are
-considered to be protected by the BufFreelistLock, not the buffer-header
+considered to be protected by the freelist_lck, not the buffer-header
spinlocks.) To choose a victim buffer to recycle when there are no free
buffers available, we use a simple clock-sweep algorithm, which avoids the
-need to take system-wide locks during common operations. It works like
-this:
+need to take system-wide locks during common operations. The background
+reclaimer attempts to keep regular backends from having to run clock sweep
+by maintaining buffers on freelist, however backends are also empowered
+to run clock sweep. Clock sweep works like this:
Each buffer header contains a usage counter, which is incremented (up to a
small limit value) whenever the buffer is pinned. (This requires only the
@@ -178,25 +178,28 @@ buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
through all the available buffers. nextVictimBuffer is protected by the
-BufFreelistLock.
+victimbuf_lck spinlock.
The algorithm for a process that needs to obtain a victim buffer is:
-1. Obtain BufFreelistLock.
+1. Obtain spinlock freelist_lck.
-2. If buffer free list is nonempty, remove its head buffer. If the buffer
-is pinned or has a nonzero usage count, it cannot be used; ignore it and
-return to the start of step 2. Otherwise, pin the buffer, release
-BufFreelistLock, and return the buffer.
+2. If buffer free list is nonempty, remove its head buffer and release
+the freelist_lck. Now set the bgwriter or bgreclaimer latch if required.
-3. Otherwise, select the buffer pointed to by nextVictimBuffer, and
+3. If we get the buffer, check if it is neither pinned nor
+has nonzero usage count, pin the buffer, and return the buffer.
+Otherwise again try to get the buffer from freelist and return
+to the start of step 3.
+
+4. Otherwise, select the buffer pointed to by nextVictimBuffer, and
circularly advance nextVictimBuffer for next time.
-4. If the selected buffer is pinned or has a nonzero usage count, it cannot
-be used. Decrement its usage count (if nonzero) and return to step 3 to
+5. If the selected buffer is pinned or has a nonzero usage count, it cannot
+be used. Decrement its usage count (if nonzero) and return to step 4 to
examine the next buffer.
-5. Pin the selected buffer, release BufFreelistLock, and return the buffer.
+6. Pin the selected buffer, and return the buffer.
(Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will
@@ -259,7 +262,7 @@ dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer.
If we can assume that reading nextVictimBuffer is an atomic action, then
-the writer doesn't even need to take the BufFreelistLock in order to look
+the writer doesn't even need to take the spinlock in order to look
for buffers to write; it needs only to spinlock each buffer header for long
enough to check the dirtybit. Even without that assumption, the writer
only needs to take the lock long enough to read the variable value, not
@@ -281,3 +284,19 @@ As of 8.4, background writer starts during recovery mode when there is
some form of potentially extended recovery to perform. It performs an
identical service to normal processing, except that checkpoints it
writes are technically restartpoints.
+
+
+Background Reclaimer's Processing
+---------------------------------
+
+The background reclaimer is designed to move buffers to freelist that are
+likely to be recycled soon, thereby offloading the need to perform
+clock sweep work from active backends. To do this, it runs the clock sweep
+and move the the unpinned and zero usage count buffers to freelist. It
+keep on doing this until the number of buffers in freelist become equal
+high threshold of freelist.
+
+Two threshold indicators are used to maintain sufficient number of buffers
+on freelist. Low threshold indicator is used by backends to wake bgreclaimer
+when the number of buffers in freelist fall below it. High threshold
+indicator is used by bgreclaimer to move buffers to freelist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 938c554..8df0eee 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -605,15 +605,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
/* Loop here in case we have to try another victim buffer */
for (;;)
{
- bool lock_held;
-
/*
* Select a victim buffer. The buffer is returned with its header
- * spinlock still held! Also (in most cases) the BufFreelistLock is
- * still held, since it would be bad to hold the spinlock while
- * possibly waking up other processes.
+ * spinlock still held!
*/
- buf = StrategyGetBuffer(strategy, &lock_held);
+ buf = StrategyGetBuffer(strategy);
Assert(buf->refcount == 0);
@@ -623,10 +619,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
/* Pin the buffer and then release the buffer spinlock */
PinBuffer_Locked(buf);
- /* Now it's safe to release the freelist lock */
- if (lock_held)
- LWLockRelease(BufFreelistLock);
-
/*
* If the buffer was dirty, try to write it out. There is a race
* condition here, in that someone might dirty it after we released it
@@ -1637,6 +1629,74 @@ BgBufferSync(void)
}
/*
+ * Move buffers with reference and usage_count as zero to freelist.
+ * By maintaining enough number of buffers on freelist (equal to
+ * high threshold of freelsit), we drasticaly reduce the odds for
+ * backend's to perform clock sweep.
+ *
+ * This is called by the background reclaim process when the number
+ * of buffers in freelist fall below low threshold of freelist.
+ */
+void
+BgMoveBuffersToFreelist(void)
+{
+ volatile uint32 next_to_clean;
+ uint32 num_to_free;
+ uint32 tmp_num_to_free;
+ uint32 save_next_to_clean;
+ uint32 recent_alloc;
+ volatile BufferDesc *bufHdr;
+
+ StrategySyncStartAndEnd(&save_next_to_clean,
+ &num_to_free,
+ &recent_alloc);
+
+ /* Report buffer alloc counts to pgstat */
+ BgWriterStats.m_buf_alloc += recent_alloc;
+
+ tmp_num_to_free = num_to_free;
+ next_to_clean = save_next_to_clean;
+
+ /* Execute the LRU scan */
+ while (tmp_num_to_free > 0)
+ {
+ bufHdr = &BufferDescriptors[next_to_clean];
+
+ LockBufHdr(bufHdr);
+
+ if (bufHdr->refcount == 0)
+ {
+ if (bufHdr->usage_count > 0)
+ {
+ /*
+ * Reduce usage count so that we can find the reusable
+ * buffers in consecutive cycles.
+ */
+ bufHdr->usage_count--;
+ UnlockBufHdr(bufHdr);
+ }
+ else
+ {
+ UnlockBufHdr(bufHdr);
+ if (StrategyMoveBufferToFreeListEnd (bufHdr))
+ tmp_num_to_free--;
+ }
+ }
+ else
+ UnlockBufHdr(bufHdr);
+
+ /* choose next victim buffer to clean. */
+ StrategySyncNextVictimBuffer(&next_to_clean);
+ }
+
+
+#ifdef BGW_DEBUG
+ elog(DEBUG1, "bgreclaimer: recent_alloc=%u next_to_clean=%d num_freed=%u",
+ recent_alloc, save_next_to_clean, num_to_free);
+#endif
+}
+
+/*
* SyncOneBuffer -- process a single buffer during syncing.
*
* If skip_recently_used is true, we don't write currently-pinned buffers, nor
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4befab0..9594f92 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -29,6 +29,7 @@ typedef struct
int firstFreeBuffer; /* Head of list of unused buffers */
int lastFreeBuffer; /* Tail of list of unused buffers */
+ int numFreeListBuffers; /* number of buffers on freelist */
/*
* NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
@@ -43,9 +44,27 @@ typedef struct
uint32 numBufferAllocs; /* Buffers allocated since last reset */
/*
- * Notification latch, or NULL if none. See StrategyNotifyBgWriter.
+ * protects freelist variables (firstFreeBuffer, lastFreeBuffer,
+ * numFreeListBuffers, BufferDesc->freeNext).
+ */
+ slock_t freelist_lck;
+
+ /*
+ * Protects nextVictimBuffer and completePasses. We need separate
+ * lock to protect victim buffer and completePasses so that
+ * clock sweep of one backend doesn't contend with another backend
+ * which is evicting buffer from freelist.
+ */
+ slock_t victimbuf_lck;
+
+ /*
+ * Latch to wake bgwriter.
*/
Latch *bgwriterLatch;
+ /*
+ * Latch to wake bgreclaimer.
+ */
+ Latch *bgreclaimerLatch;
} BufferStrategyControl;
/* Pointers to shared state */
@@ -84,6 +103,19 @@ typedef struct BufferAccessStrategyData
Buffer buffers[1]; /* VARIABLE SIZE ARRAY */
} BufferAccessStrategyData;
+/*
+ * Threshold indicators for maintaining buffers on freelist. When the
+ * number of buffers on freelist drops below the low threshold, the
+ * allocating backend sets the latch and bgreclaimer wakesup and begin
+ * adding buffer's to freelist until it reaches high threshold and then
+ * again goes back to sleep.
+ */
+int freelistLowThreshold;
+int freelistHighThreshold;
+
+/* Percentage indicators for maintaining buffers on freelist */
+#define HIGH_THRESHOLD_FREELIST_BUFFERS_PERCENT 0.005
+#define LOW_THRESHOLD_FREELIST_BUFFERS_PERCENT 0.2
/* Prototypes for internal functions */
static volatile BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy);
@@ -101,67 +133,51 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
* strategy is a BufferAccessStrategy object, or NULL for default strategy.
*
* To ensure that no one else can pin the buffer before we do, we must
- * return the buffer with the buffer header spinlock still held. If
- * *lock_held is set on exit, we have returned with the BufFreelistLock
- * still held, as well; the caller must release that lock once the spinlock
- * is dropped. We do it that way because releasing the BufFreelistLock
- * might awaken other processes, and it would be bad to do the associated
- * kernel calls while holding the buffer header spinlock.
+ * return the buffer with the buffer header spinlock still held.
*/
volatile BufferDesc *
-StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
+StrategyGetBuffer(BufferAccessStrategy strategy)
{
- volatile BufferDesc *buf;
+ volatile BufferDesc *buf = NULL;
Latch *bgwriterLatch;
+ Latch *bgreclaimerLatch;
+ int numFreeListBuffers;
int trycounter;
/*
* If given a strategy object, see whether it can select a buffer. We
- * assume strategy objects don't need the BufFreelistLock.
+ * assume strategy objects don't need the freelist_lck.
*/
if (strategy != NULL)
{
buf = GetBufferFromRing(strategy);
if (buf != NULL)
- {
- *lock_held = false;
return buf;
- }
}
/* Nope, so lock the freelist */
- *lock_held = true;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
- * We count buffer allocation requests so that the bgwriter can estimate
- * the rate of buffer consumption. Note that buffers recycled by a
- * strategy object are intentionally not counted here.
+ * We count buffer allocation requests so that the bgwriter or bgreclaimer
+ * can know the rate of buffer consumption and report it as stats. Note
+ * that buffers recycled by a strategy object are intentionally not counted
+ * here.
*/
StrategyControl->numBufferAllocs++;
/*
- * If bgwriterLatch is set, we need to waken the bgwriter, but we should
- * not do so while holding BufFreelistLock; so release and re-grab. This
- * is annoyingly tedious, but it happens at most once per bgwriter cycle,
- * so the performance hit is minimal.
+ * Remember the values of bgwriter and bgreclaimer latch so that they can
+ * be set outside spin lock and try to get a buffer from the freelist.
*/
+ bgreclaimerLatch = StrategyControl->bgreclaimerLatch;
bgwriterLatch = StrategyControl->bgwriterLatch;
if (bgwriterLatch)
- {
StrategyControl->bgwriterLatch = NULL;
- LWLockRelease(BufFreelistLock);
- SetLatch(bgwriterLatch);
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
- }
- /*
- * Try to get a buffer from the freelist. Note that the freeNext fields
- * are considered to be protected by the BufFreelistLock not the
- * individual buffer spinlocks, so it's OK to manipulate them without
- * holding the spinlock.
- */
- while (StrategyControl->firstFreeBuffer >= 0)
+ numFreeListBuffers = StrategyControl->numFreeListBuffers;
+
+ if (StrategyControl->firstFreeBuffer >= 0)
{
buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
@@ -169,28 +185,86 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+ }
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ /*
+ * If bgwriterLatch is set, we need to waken the bgwriter, but we should
+ * not do so while holding freelist_lck; so set it after releasing the
+ * freelist_lck. This is annoyingly tedious, but it happens at most once
+ * per bgwriter cycle, so the performance hit is minimal.
+ */
+ if (bgwriterLatch)
+ SetLatch(bgwriterLatch);
+ /*
+ * Ideally numFreeListBuffers should get called under freelist spinlock,
+ * however here we need this number for estimating approximate number of
+ * free buffers required on freelist, so it should not be a problem, even
+ * if numFreeListBuffers is not exact. bgreclaimerLatch is initialized in
+ * early phase of BgReclaimer startup, however we still check before using
+ * it to avoid any problem incase we reach here before its initializion.
+ */
+ if (numFreeListBuffers < freelistLowThreshold && bgreclaimerLatch)
+ SetLatch(StrategyControl->bgreclaimerLatch);
+
+ if (buf != NULL)
+ {
/*
- * If the buffer is pinned or has a nonzero usage_count, we cannot use
- * it; discard it and retry. (This can only happen if VACUUM put a
- * valid buffer in the freelist and then someone else used it before
- * we got to it. It's probably impossible altogether as of 8.3, but
- * we'd better check anyway.)
+ * Try to get a buffer from the freelist. Note that the freeNext fields
+ * are considered to be protected by the freelist_lck not the
+ * individual buffer spinlocks, so it's OK to manipulate them without
+ * holding the buffer spinlock.
*/
- LockBufHdr(buf);
- if (buf->refcount == 0 && buf->usage_count == 0)
+ for(;;)
{
- if (strategy != NULL)
- AddBufferToRing(strategy, buf);
- return buf;
+ /*
+ * If the buffer is pinned or has a nonzero usage_count, we cannot use
+ * it; discard it and retry. (This can only happen if VACUUM put a
+ * valid buffer in the freelist and then someone else used it before
+ * we got to it. It's probably impossible altogether as of 8.3, but
+ * we'd better check anyway.)
+ */
+ LockBufHdr(buf);
+ if (buf->refcount == 0 && buf->usage_count == 0)
+ {
+ if (strategy != NULL)
+ AddBufferToRing(strategy, buf);
+ return buf;
+ }
+ UnlockBufHdr(buf);
+
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ if (StrategyControl->firstFreeBuffer >= 0)
+ {
+ buf = &BufferDescriptors[StrategyControl->firstFreeBuffer];
+ Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);
+
+ /* Unconditionally remove buffer from freelist */
+ StrategyControl->firstFreeBuffer = buf->freeNext;
+ buf->freeNext = FREENEXT_NOT_IN_LIST;
+ --StrategyControl->numFreeListBuffers;
+
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ }
+ else
+ {
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ break;
+ }
}
- UnlockBufHdr(buf);
}
/* Nothing on the freelist, so run the "clock sweep" algorithm */
trycounter = NBuffers;
+
for (;;)
{
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+
buf = &BufferDescriptors[StrategyControl->nextVictimBuffer];
if (++StrategyControl->nextVictimBuffer >= NBuffers)
@@ -199,6 +273,8 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
StrategyControl->completePasses++;
}
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; decrement the usage_count (unless pinned) and keep scanning.
@@ -241,7 +317,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void
StrategyFreeBuffer(volatile BufferDesc *buf)
{
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
/*
* It is possible that we are told to put something in the freelist that
@@ -253,12 +329,51 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
if (buf->freeNext < 0)
StrategyControl->lastFreeBuffer = buf->buf_id;
StrategyControl->firstFreeBuffer = buf->buf_id;
+ ++StrategyControl->numFreeListBuffers;
}
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
}
/*
+ * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist
+ */
+bool
+StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf)
+{
+ bool freed = false;
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+
+ /*
+ * It is possible that we are told to put something in the freelist that
+ * is already in it; don't screw up the list if so.
+ */
+ if (buf->freeNext == FREENEXT_NOT_IN_LIST)
+ {
+ ++StrategyControl->numFreeListBuffers;
+ freed = true;
+ /*
+ * put the buffer on end of list and if list is empty then
+ * assign first and last freebuffer with this buffer id.
+ */
+ buf->freeNext = FREENEXT_END_OF_LIST;
+ if (StrategyControl->firstFreeBuffer < 0)
+ {
+ StrategyControl->firstFreeBuffer = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+ return freed;
+ }
+ BufferDescriptors[StrategyControl->lastFreeBuffer].freeNext = buf->buf_id;
+ StrategyControl->lastFreeBuffer = buf->buf_id;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return freed;
+}
+
+
+/*
* StrategySyncStart -- tell BufferSync where to start syncing
*
* The result is the buffer index of the best buffer to sync first.
@@ -274,20 +389,76 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{
int result;
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
result = StrategyControl->nextVictimBuffer;
+
if (complete_passes)
*complete_passes = StrategyControl->completePasses;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
if (num_buf_alloc)
{
+ SpinLockAcquire(&StrategyControl->freelist_lck);
*num_buf_alloc = StrategyControl->numBufferAllocs;
StrategyControl->numBufferAllocs = 0;
+ SpinLockRelease(&StrategyControl->freelist_lck);
}
- LWLockRelease(BufFreelistLock);
return result;
}
/*
+ * StrategySyncStartAndEnd -- tell Bgreclaimer where to start looking
+ * for unused buffers.
+ *
+ * The result is the buffer index of the best buffer to start looking for
+ * unused buffers, number of buffers that are required to be moved to
+ * freelist and count of recent buffer allocs.
+ */
+void
+StrategySyncStartAndEnd(uint32 *start, uint32 *end, uint32 *num_buf_alloc)
+{
+ int curfreebuffers;
+
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+ *start = StrategyControl->nextVictimBuffer;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+ curfreebuffers = StrategyControl->numFreeListBuffers;
+ if (curfreebuffers < freelistHighThreshold)
+ *end = freelistHighThreshold - curfreebuffers;
+ else
+ *end = 0;
+
+ /*
+ * We need numBufferAllocs just for statistics purpose, so getting
+ * the number with lock.
+ */
+ if (num_buf_alloc)
+ {
+ *num_buf_alloc = StrategyControl->numBufferAllocs;
+ StrategyControl->numBufferAllocs = 0;
+ }
+ SpinLockRelease(&StrategyControl->freelist_lck);
+
+ return;
+}
+
+/*
+ * StrategySyncNextVictimBuffer -- tell Bgreclaimer where to start looking
+ * for next unused buffer.
+ */
+void
+StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer)
+{
+ SpinLockAcquire(&StrategyControl->victimbuf_lck);
+ if (++StrategyControl->nextVictimBuffer >= NBuffers)
+ StrategyControl->nextVictimBuffer = 0;
+ *next_victim_buffer = StrategyControl->nextVictimBuffer;
+ SpinLockRelease(&StrategyControl->victimbuf_lck);
+}
+
+/*
* StrategyNotifyBgWriter -- set or clear allocation notification latch
*
* If bgwriterLatch isn't NULL, the next invocation of StrategyGetBuffer will
@@ -299,15 +470,27 @@ void
StrategyNotifyBgWriter(Latch *bgwriterLatch)
{
/*
- * We acquire the BufFreelistLock just to ensure that the store appears
+ * We acquire the freelist_lck just to ensure that the store appears
* atomic to StrategyGetBuffer. The bgwriter should call this rather
* infrequently, so there's no performance penalty from being safe.
*/
- LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&StrategyControl->freelist_lck);
StrategyControl->bgwriterLatch = bgwriterLatch;
- LWLockRelease(BufFreelistLock);
+ SpinLockRelease(&StrategyControl->freelist_lck);
}
+/*
+ * StrategyInitBgReclaimerLatch -- Initialize bgreclaimer latch.
+ * This will be used by Bgreclaimer to wake itself when backend
+ * sets this latch.
+ */
+void
+StrategyInitBgReclaimerLatch(Latch *bgreclaimerLatch)
+{
+ SpinLockAcquire(&StrategyControl->freelist_lck);
+ StrategyControl->bgreclaimerLatch = bgreclaimerLatch;
+ SpinLockRelease(&StrategyControl->freelist_lck);
+}
/*
* StrategyShmemSize
@@ -376,6 +559,7 @@ StrategyInitialize(bool init)
*/
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ StrategyControl->numFreeListBuffers = NBuffers;
/* Initialize the clock sweep pointer */
StrategyControl->nextVictimBuffer = 0;
@@ -386,9 +570,31 @@ StrategyInitialize(bool init)
/* No pending notification */
StrategyControl->bgwriterLatch = NULL;
+ StrategyControl->bgreclaimerLatch = NULL;
+ SpinLockInit(&StrategyControl->freelist_lck);
+ SpinLockInit(&StrategyControl->victimbuf_lck);
}
else
Assert(!init);
+
+ /*
+ * Initialize the low and high threshold number of buffer's
+ * for freelist. This is used to maintain buffer's on freelist
+ * so that backend doesn't often need to perform clock sweep to
+ * find the buffer. We need to maintain enough buffers so that
+ * requests can be satisfied from freelist, if based on threshold
+ * calculation count of buffers on freelist goes beyond 2000 or
+ * lesser than 5, then we set it to hard coded values. These numbers
+ * are based on results of benchmarks at various workloads.
+ */
+ freelistHighThreshold = HIGH_THRESHOLD_FREELIST_BUFFERS_PERCENT * NBuffers;
+ if (freelistHighThreshold < 5)
+ freelistHighThreshold = 5;
+ else if (freelistHighThreshold > 2000)
+ freelistHighThreshold = 2000;
+
+ freelistLowThreshold = LOW_THRESHOLD_FREELIST_BUFFERS_PERCENT *
+ freelistHighThreshold;
}
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c2b786e..826af06 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -366,6 +366,7 @@ typedef enum
CheckpointerProcess,
WalWriterProcess,
WalReceiverProcess,
+ BgReclaimerProcess,
NUM_AUXPROCTYPES /* Must be last! */
} AuxProcType;
diff --git a/src/include/postmaster/bgreclaimer.h b/src/include/postmaster/bgreclaimer.h
new file mode 100644
index 0000000..bbd6943
--- /dev/null
+++ b/src/include/postmaster/bgreclaimer.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * bgreclaimer.h
+ * POSTGRES buffer reclaimer definitions.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/bgreclaimer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _BGRECLAIMER_H
+#define _BGRECLAIMER_H
+
+extern void BackgroundReclaimerMain(void) __attribute__((noreturn));
+
+
+#endif /* _BGRECLAIMER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c019013..f7a1631 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -115,9 +115,8 @@ typedef struct buftag
* Note: buf_hdr_lock must be held to examine or change the tag, flags,
* usage_count, refcount, or wait_backend_pid fields. buf_id field never
* changes after initialization, so does not need locking. freeNext is
- * protected by the BufFreelistLock not buf_hdr_lock. The LWLocks can take
- * care of themselves. The buf_hdr_lock is *not* used to control access to
- * the data in the buffer!
+ * protected by the freelist_lck not buf_hdr_lock. The buf_hdr_lock is
+ * *not* used to control access to the data in the buffer!
*
* An exception is that if we have the buffer pinned, its tag can't change
* underneath us, so we can examine the tag without locking the spinlock.
@@ -185,14 +184,18 @@ extern BufferDesc *LocalBufferDescriptors;
*/
/* freelist.c */
-extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
- bool *lock_held);
+extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
extern void StrategyFreeBuffer(volatile BufferDesc *buf);
+extern bool StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf);
extern int StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc);
+extern void StrategySyncStartAndEnd(uint32 *start, uint32 *end,
+ uint32 *num_buf_alloc);
+extern void StrategySyncNextVictimBuffer(volatile uint32 *next_victim_buffer);
extern void StrategyNotifyBgWriter(Latch *bgwriterLatch);
+extern void StrategyInitBgReclaimerLatch(Latch *bgwriterLatch);
extern Size StrategyShmemSize(void);
extern void StrategyInitialize(bool init);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 89447d0..edb9c52 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -219,6 +219,7 @@ extern void AbortBufferIO(void);
extern void BufmgrCommit(void);
extern bool BgBufferSync(void);
+extern void BgMoveBuffersToFreelist(void);
extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 1d90b9f..46f6aeb 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -89,45 +89,44 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
* if you remove a lock, consider leaving a gap in the numbering sequence for
* the benefit of DTrace and other external debugging scripts.
*/
-#define BufFreelistLock (&MainLWLockArray[0].lock)
-#define ShmemIndexLock (&MainLWLockArray[1].lock)
-#define OidGenLock (&MainLWLockArray[2].lock)
-#define XidGenLock (&MainLWLockArray[3].lock)
-#define ProcArrayLock (&MainLWLockArray[4].lock)
-#define SInvalReadLock (&MainLWLockArray[5].lock)
-#define SInvalWriteLock (&MainLWLockArray[6].lock)
-#define WALBufMappingLock (&MainLWLockArray[7].lock)
-#define WALWriteLock (&MainLWLockArray[8].lock)
-#define ControlFileLock (&MainLWLockArray[9].lock)
-#define CheckpointLock (&MainLWLockArray[10].lock)
-#define CLogControlLock (&MainLWLockArray[11].lock)
-#define SubtransControlLock (&MainLWLockArray[12].lock)
-#define MultiXactGenLock (&MainLWLockArray[13].lock)
-#define MultiXactOffsetControlLock (&MainLWLockArray[14].lock)
-#define MultiXactMemberControlLock (&MainLWLockArray[15].lock)
-#define RelCacheInitLock (&MainLWLockArray[16].lock)
-#define CheckpointerCommLock (&MainLWLockArray[17].lock)
-#define TwoPhaseStateLock (&MainLWLockArray[18].lock)
-#define TablespaceCreateLock (&MainLWLockArray[19].lock)
-#define BtreeVacuumLock (&MainLWLockArray[20].lock)
-#define AddinShmemInitLock (&MainLWLockArray[21].lock)
-#define AutovacuumLock (&MainLWLockArray[22].lock)
-#define AutovacuumScheduleLock (&MainLWLockArray[23].lock)
-#define SyncScanLock (&MainLWLockArray[24].lock)
-#define RelationMappingLock (&MainLWLockArray[25].lock)
-#define AsyncCtlLock (&MainLWLockArray[26].lock)
-#define AsyncQueueLock (&MainLWLockArray[27].lock)
-#define SerializableXactHashLock (&MainLWLockArray[28].lock)
-#define SerializableFinishedListLock (&MainLWLockArray[29].lock)
-#define SerializablePredicateLockListLock (&MainLWLockArray[30].lock)
-#define OldSerXidLock (&MainLWLockArray[31].lock)
-#define SyncRepLock (&MainLWLockArray[32].lock)
-#define BackgroundWorkerLock (&MainLWLockArray[33].lock)
-#define DynamicSharedMemoryControlLock (&MainLWLockArray[34].lock)
-#define AutoFileLock (&MainLWLockArray[35].lock)
-#define ReplicationSlotAllocationLock (&MainLWLockArray[36].lock)
-#define ReplicationSlotControlLock (&MainLWLockArray[37].lock)
-#define NUM_INDIVIDUAL_LWLOCKS 38
+#define ShmemIndexLock (&MainLWLockArray[0].lock)
+#define OidGenLock (&MainLWLockArray[1].lock)
+#define XidGenLock (&MainLWLockArray[2].lock)
+#define ProcArrayLock (&MainLWLockArray[3].lock)
+#define SInvalReadLock (&MainLWLockArray[4].lock)
+#define SInvalWriteLock (&MainLWLockArray[5].lock)
+#define WALBufMappingLock (&MainLWLockArray[6].lock)
+#define WALWriteLock (&MainLWLockArray[7].lock)
+#define ControlFileLock (&MainLWLockArray[8].lock)
+#define CheckpointLock (&MainLWLockArray[9].lock)
+#define CLogControlLock (&MainLWLockArray[10].lock)
+#define SubtransControlLock (&MainLWLockArray[11].lock)
+#define MultiXactGenLock (&MainLWLockArray[12].lock)
+#define MultiXactOffsetControlLock (&MainLWLockArray[13].lock)
+#define MultiXactMemberControlLock (&MainLWLockArray[14].lock)
+#define RelCacheInitLock (&MainLWLockArray[15].lock)
+#define CheckpointerCommLock (&MainLWLockArray[16].lock)
+#define TwoPhaseStateLock (&MainLWLockArray[17].lock)
+#define TablespaceCreateLock (&MainLWLockArray[18].lock)
+#define BtreeVacuumLock (&MainLWLockArray[19].lock)
+#define AddinShmemInitLock (&MainLWLockArray[20].lock)
+#define AutovacuumLock (&MainLWLockArray[21].lock)
+#define AutovacuumScheduleLock (&MainLWLockArray[22].lock)
+#define SyncScanLock (&MainLWLockArray[23].lock)
+#define RelationMappingLock (&MainLWLockArray[24].lock)
+#define AsyncCtlLock (&MainLWLockArray[25].lock)
+#define AsyncQueueLock (&MainLWLockArray[26].lock)
+#define SerializableXactHashLock (&MainLWLockArray[27].lock)
+#define SerializableFinishedListLock (&MainLWLockArray[28].lock)
+#define SerializablePredicateLockListLock (&MainLWLockArray[29].lock)
+#define OldSerXidLock (&MainLWLockArray[30].lock)
+#define SyncRepLock (&MainLWLockArray[31].lock)
+#define BackgroundWorkerLock (&MainLWLockArray[32].lock)
+#define DynamicSharedMemoryControlLock (&MainLWLockArray[33].lock)
+#define AutoFileLock (&MainLWLockArray[34].lock)
+#define ReplicationSlotAllocationLock (&MainLWLockArray[35].lock)
+#define ReplicationSlotControlLock (&MainLWLockArray[36].lock)
+#define NUM_INDIVIDUAL_LWLOCKS 37
/*
* It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
@@ -136,7 +135,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
*/
/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS 16
+#define NUM_BUFFER_PARTITIONS 128
/* Number of partitions the shared lock tables are divided into */
#define LOG2_NUM_LOCK_PARTITIONS 4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..b0688a8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -215,11 +215,12 @@ extern PGPROC *PreparedXactProcs;
* We set aside some extra PGPROC structures for auxiliary processes,
* ie things that aren't full-fledged backends but need shmem access.
*
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, Background reclaimer, checkpointer and WAL writer run
+ * during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
*/
-#define NUM_AUXILIARY_PROCS 4
+#define NUM_AUXILIARY_PROCS 5
/* configurable options */
perf_read_scalability_data_v5.odsapplication/vnd.oasis.opendocument.spreadsheet; name=perf_read_scalability_data_v5.odsDownload
PK �XE�l9�. . mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK �XEU �zR R meta.xml<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.2"><office:meta><meta:creation-date>2014-03-31T15:29:55</meta:creation-date><dc:date>2014-08-28T16:37:14</dc:date><meta:editing-duration>P1DT5H51M49S</meta:editing-duration><meta:editing-cycles>23</meta:editing-cycles><meta:generator>LibreOffice/3.5$MacOSX_x86 LibreOffice_project/7e68ba2-a744ebf-1f241b7-c506db1-7d53735</meta:generator><meta:document-statistic meta:table-count="1" meta:cell-count="53" meta:object-count="0"/></office:meta></office:document-meta>PK �XE settings.xml�Zms�8�~���wBpH�� t�^��4�I��7a/����H2����0d���%s��'���]����������� $E���'�V
��>����0���a}��v�� �������.A)�E��q.�x�mE�;H$�'HGy��7�����JX���Q���fJ�N��X,Ng'(�����ec����!����������E�9+�f����g��Vr�4����a��;Wk�K�*�mj��F���E:s
��Yi�v�<RI���Ck����^�\Y��y����(�[��4����_��fi�����Ea���Y��M��n�_HX���g���`����aby���� 5�:�� ��]��j@z����>Ew�� &(�=��D��� ��%U:��2����!�g��oP��an���}Hzb��@��P��
=���Bef����w=E��B>�0��|�o���f�KA�./��C�0H��|����F)5�
�#aQ5����6 S04�S������M2���VG���Sn_Y�+�?4�V��q�$I�0 y�y�����_�������� Sd��0�y���W���#��;TUA�K�}���E��I��}qf��9���V`�"��Q��������!�@i�� } �DJwy�.�12�B��-E��I8�!�
��.#�b��\�)�I������5��d�$|j9^���>�S��~I����
���L����L �?����u���A�:|?�1>f"q��jzA����C����� d�}�p�k#$��d>&.Z=��XZ��f�(�����gDO.:9��SD^7M��x�v##?����IA OR�5����{0��V�n�?����{����S������w�S_������F
��yQ5�@~\�^�j�����[�C���.��c����g(�+h�,��w����;t
<��I��7����l� A|4��
�3���F�#����y�AgR^~�)�z�N���Cus��
��A(@�)}6X�����Z�1�[�i��o����N��h�_7d6��W�/hd����/PKl&@' �! PK �XE content.xml�[���8}�� <�`XYW��'�A����$���D�������4�08_�EJ�)���/!�A���EU��u�,u����Y���qZ���;vF��q��|����G+}��g�jEc2O���H.���|G`��y�t1�X>/0�|�������EI��j�k���z���t��R������R�c���=+e�:ax7�X�BPu�U1�x�SkU@��z�b����b�����n���q���;��l�� 8>��K�V�$%����k��x(>��C��lI���`���j�X�L�a�6����g�v}&����y�����'�S�Ot�������x�>n^��eC}I�N�bF�����u��(P�AMv�s���eM{wQ}�� LS�/��8�/�����k��E�2�$���g<�~|P�����s��M�!>*�+[4����0� gW:�)&�Y
/��[��Fd���!���k�$�� ���� ��-%��:��r>�l�tH\J��%�f9d_F�Jp*��8
��(��u���l?l:�E�:���1��������|f��
h�I;V��U{�����a`g���1����Y]���%���eQ1*��U�hz�?���3kMrX,P'+���JIE�eE�$���?g�=�������r��w��V�����������7=(��e��;U�G/����d�A�o
��;���>�,�8�84��<�,R��q��5( ��K�,����U)�p�*�KF�{kI�0�t�����h"�,o�M�8S�58���s�X�;#:������ K�q0���e�'VQ
�������}p{O���Qd��d���M'�pO�'��n������L�U�5�j�k��U*N�-�nJ���6x���� ZKV��(e�X��se���Z_b&�A%\,�N����R�T��R�w�`������"� ��Q_����=����Sa{�����p�9z�.���������av'I Mu|T���)X�w��AN���[�;�\�q���f��~c�>�d����&�����-�`���4�Er{�PC���!D\=����lJc�v�����Ve�<�&�sg������.Q���]G��-h�����Hj��\�����
�K�Ky7:A�}�������y=��m����X��c.<�^�o+����K���
���"_�u�����]/�5�����W�����a���A����^[S��(�s����`����"�Q��p/ �����Yz��i�!/�����1�������7$~_����d�^�-�I�Ts�]��@�I����p'�����H�|�%>�Np�5��<��<'\6�\eL����#+F����o��P\�9 � �7N��I=&�r &z&{�S�������{Z�@O��7[Lzk��������_� ��eT��$�%�'������@�!��e�Z��O/�Co���(����Oe�r�s(��a��uj��*���z]��%��!����{Q������(:���:v��g74�����}���08����������$xZ� %,�h�&��'��n��G��#�lB�S=U�l8
A8u�(H�,Q'��!P���� ;��p�3D��2���"e��q��Tg��M3��L;��,c���5�H��`�e��1n�a�k��;3����l�?���d�e�Qp��s�J�(�iEZ����>��-W�����t�ho���<��MJ���|G�4�/2�m'�Rj|*����K�����B�MZ�� C�q���Mrg��i��d����|W'���B�\ojj�0��D�Ij�������D���:3�����sF��9�p&4�g�y��Z��ix]��7������
Hz���������J6��w;���a����:g�l�3���t�H����GZ�Kb�����kv���;s����J��+v�PK�\j� l6 PK �XE� �( �( Thumbnails/thumbnail.png�PNG
IHDR � d��? (�IDATx���XI�'J-A)*��DE��������{C��7,����<{��+*vA����H�JI8���l�%K���{���������;��fw����@ FcV�@L�=�b �����*B�=�b �����*B�=�b �`�bX,�f��/e�V��t3/��h����l���S��o|�R������>�'��������"O#������# ����G7PR������I��Ldy�)�p����"MPlb �U��1>������9��Zs�Y�!k0�
R�����L�B���Eb������Br��Y2-u��@�@������dE�E��5;��$`�bha��n���b ��#y�i���Z=�<�X��os�a�fu��zZ�U�����c������jN[���f��&z,V.��t[0�����6z�����M��^4����Qf�wp��h�������'�'��[���5;J����v�dqbT�Q�'��z?L�=�1[s���&G���[p�����gj��"�������oj
R7N����q��tTY��r:f�����c��a�5�kRWG���V��s���6�F�3"����/����J���`�bd�)���u�2�����I�=�����r��8���^����w����:r��>�=r�J�8|����^[�����.��Q5��u��@A��^�z�'G��o<���`�?�l��s^0����_sg��J�v�\D�oy
��:��c�����(��Nuh�38�7Wy K�� ��Z��
������H�7wi6���+�����uM�u�h�m�o;7�W�&���O�N^.��e����-PH�
�X\������T"��������l��U���OV�hY�{�������S�]�� 3��T��J\X��QuAbaZ������RI�8�����S��R�$1��4G���rqN_�S(o�-�x'�s���V�B�\�����gK��<��b���b�b G�\��(���.�����F}Z��A��������� ��@=���\\�98 � ?�E��mS��qV^�����Yp~�9}�3�=�y�����n�d]wY��ts����P*c{8�r*��I�
������ST2�~�\��.���9!�����`���N��(o��O�o6�o���9���������y�3�'O�����w�S/�9��/���=��^�;yE|�V���<��.
�x0������~z4U�o_���1�y���J�d'�\�v��a\��5I,���n�_4@ws���a����������q�P�F���:�����4�����j�����w�,�����)lG�GG,u�2+��bu���j�8���'��*�t�gW���/��SIy�����"R\<v@(~���?�����9xL��g�>������WE ���Wh��?��E�}G�J��������[�uu��'>����w����G��S��M�I:�'���[~���:3�q�'bWJ�?�v(y��7}�g^��J0�n����?��g�H�QZ�vk�}�4��b�O9�m��e
�\�,�������?�0�>X������x?{��&�2����l�^���mj��3 ���p0u`��2��'^��c�W9��n$x-9��������������<2x�I"�[�]j�����vm0�����|MT j8��|�x�j�n'��;R�����c���+WE� T���������
�~�6G�4���[`�~��3k���a����_E�+�<��k#���M�H�����E-\��3�Orf��wm��4e�����^wb�������g���h����{�L:�����'����O'��mT9�!����%��\���P��i�����2�������j�����]���y�P�>��%�@����.�aZ������V�e�Z.���w"�����f�O2-�-8�E+e���.����{��im�uC��C���~YON�i�.Ar u��>�l:�9T������^���?�|�����N�P_V#��*�|�vFm�w��~
�8>Sn��h��H�Xf��q���c�G!����>�3�F���SO�u�wKT�_s{e6���T5nh��Y�F�.+l�6�p$
�)F!KQV��+K���g>8�w��S�H�>�\�� ��������OAU#y�T��n�<������?�xw�Tn��l�pTn��)L��(_N��1*�Jaa^�]�%�����WX��y�b�!{R�GFj�|w�)�qg�heyb��^���F-����q�a�5N�mU�[$v���l��G�������o5_�������k�o��`�)q�#��N����s�+���W�����a�����&j�t���q����F�j7���iJy���e�F.�8�����^)3���#�f~����s���d7xi]��1����z��D��:�����E��Jk;�tU��dY��\}���hu^�Ly38��zs���o������=��,��.J�������������@r�]K`�U$b9p��_SV��Qu���� �G�E�eC�$��j�������\�C�a�~8��Q(�:�v�r-m�\m���,~��V�g+Mse���"9:�Xu��#>����]�xt+E6c^�����Ko�&����S����jyP�_M mp���&���[i�n_y*���M|Sp^��`��d�����4��+|������H�%��2�
9�}��(�CFx�3���s|��/9�g�D�r���(�9�v�,�H|!�!�e�,�����\k`�M�N�*��G�'�y�%�gJ8�fq�U��7?�HH���/�u����#�x�����psC����E�?���.�Ggg\.w�`w�
�����������,Pa����;r�� �8��5���C��sD�.��.E ,�����f�o�t��M��zY:8j�T�c��^o�>�6���Q*rM:�h��(��nQ���&3���EU�I1L1v���U�S��.d���+��"M�~��)��D�U��:#�QBC���]I
?���`�RB����#�h��P��F�A���q�A^F�_c������9�T�����q�A0�>T���������0�=�Z_��i�bL�����8� :_�\6��N����P�+��<���5��amA��&|�_�}W0Z1�~����%����:C'���?��l���5#LI�UY�@�U�����L@��\ C��Q\�M��f|�o�O�8�@T�U�1o��0A�@�PU�g�
�q��<��y-���)�����
�W��s^&��9�C�IF���C��Z+��n�_f2C�`$�\���"���f)�B���w*��fH����B���A�CV�#��0���������'�H0{��b�'�Y�$���"y�
������@>����R29�h`�b�B��
&6��X���b �T�UL�9x���5�^n���)��<��)�!�f(�=:C�C7��gd��S�����8u���N��&�*dl}�,��Q^1�h<����S��=�?"aKg�
C-t�J��t����#��J
��G�c��8���t<�A��������{��&b���U#w���|�n�7,����~�tG�^����h���o���������C��H����&n��b�h���u�
�N�O|��S�t��o�����I����P� ��sw�G������z9�_w���/���Ml�*DgAF����~�3�J�rI�X^���"���MY�Gzk2y��[�R��h#����Er�O7��B��P�$9RK��TZ�+z[`[�Y����Y����!P�Jo���%����;���J^�U����W�'{�h~J�aHs%9yfU��h�V��^dXxy;�������y�"����������PNK�b6��L ����B� �\K>��+��:��
_�P�=��� U���o����S���m�,�f�������Z+O]��R��}�"O�T��W?6������|���}<�*�Z�m
T�\���7�����}d��[�c��]�w�������n����>�=^%+���D�+qN���E��uIS�]w���X��:��
�2fW`�m��#����|���7��K���O�:)w�<���_\%'����5z��<A=���K!H��[6��Y���4��]��5E��~y>���wg$���>�7��d�f@�ag?�c��Q��c���/�r�xb��Si}�]M .�~��{-�&Q����'L��������E�n�l�
��xW������k'���}���l�_��zE�O�W����bs<����*Ovx�>�g�@�o�������8���vl�q;���G�G�� ��H�3K1ssk�| ��w����y!7�hlF�na��3v�qc�E�����q���)u��^3��}���|`m]4���C���7���� ��5����+�mx�[���j��o8�e�
�������&t�dc��9&.�@��.��/?��:�{�9�n��v����d�Hfmy�������m�_j�Q7�rR���'/�w����;�����pu#"�ZOo�w}�Z.J��������.�1�)[{�D������4�U4����_���>��I���������y2{u�[� G�����������+���:v���9�G��V��i.���8��|���5�����C
�Y���Vp����������-�]�I.����Q��s��*��t���\���U�e�<~�_^���,5��^�%�,s��8G.����X��K�������*se���\�'���6 �����^�W���4�Ks��� ����< ��WP��p�VV
���|���S����k[������+,���XRm ���T�8��W]2L0����������hb�����L�,P���J���"��e"��Y@�����5���M�+��m�+|f)&`l���o�73C+�Goi�T��w�L5��L����4�������v��+�%�1��)��jfO�Fu5��!��������O����zx�:~�k������[\[�~� E�����M��7v�U�����zc��y�S���u�p����� +�8K��W���y�����p�e�Hq�O��v�<H;����y�
]��?���>fyw�����U��>��ma?;z��<���{����h�K��Kl�
&�!\7�����?��92��� ����l�{���|�j��c����� /SV����N������|{9:t����U<�l�P��,� 3�
��)I�D�36�0%n��=;6����q�[���e�v�����j��e)>�v��i�_"V�^I��I�S������\sFz��=����^�f `sF�Daa��UU�9d�����w�D�Q����#����� �B�����9�q9o^�+���=�r��*5�p��������B2_y��;�d2���D���l>/���A�$�D�g�����I�yUg-a��E���J��Trn�����K�~���"'+��r�i��F�������;��IbJ��������������EU�kF8f��i�Bo=2�����^fU��PC[��pE�r�j����?v���OTh\l�� ��S�
���G
Q�PY��Z���M��qtv ��^��pZ7�3*0����
�,��.�;D[���H���R�T��0K1��{GM=��+���������2n���}����#;\�N�<tb�n�$�M�C��gl�:�=z(��h�w����9��:���mw������4�z����;W>Y0��d;
�������g u�35r��'F{.XBY�2K1�����X�������F����H-<���>[���,������,yu*z=���Z�O*JI���|�c�������<>��(�PY����R�r_U��><9��{s�2u�U��`K�'�@�i�����������e����j�N��)�WS>O/��.@=�Z�T��V����,���{�Z���f�c��NY,���&��p��:`xR/�L_��^nVM����BK77�O���v^.�=;�Za�M�:E�����-Cf)FI5�Z��a�o3c��xrf��1G�� �A���vZ3������o#"��\�Z8�����S9'��\��+|&y�?������:�/cIgU�,sP������?�ytQ8��'[� qo-8������6��m&\�Qu;���� +�����������P�&����������[�
������`������;�J����f�Ln2U�����������k�|^�����k}#"�����4'x�[!bBg�Y�,�Z!�Z������;|]�������[�8?�������CF�J�fe��������J���+��Hs��\���������������!7SE���|��@��[����&���]����>*_�q�a��z��6��m�3W� �k��8R��g-R�5s6��_G/�s0m�/_�jg�/�^YU���M�7���/��\��%��(��Kp�z��Ml��{��X�!L��sFl�5�a����4�������'.��A���g�e�1/�3�<�'��!yU��.�<�d�����*v�p��fS+���Oc����R���Q�E'��]<������>���c}n�a�����~�r@]��j�3v�
�B�������,>:�{p�?��,z� �f������z���1�yIV��O,T�u�/���
����v�(�)p5�}����zV�'3�V>������+~���n/��w�Kg�����@a�r�� ������+��1z���������4p������X,s�j��B�g,��f���d�� � qj���y��������CU3����C�b}�;A"�<:�*TSh��n���wOn���:�|V��]�O�,�Z���mA�a
}�����5�(�me���5s�!���v������?o�IpoG`�bT^��aY��I��w=��[-V8���l���/�vR^��g�����"������x�����x����e�&�G�����{�����������@]����W�lpz��z��U�I��8�%x"�C;\�%byE-W/��2I���NS��G�(�����e2�To�r���|������w�� d6!y����z'X(���m T
G{tE db����8-�Gl�
�������*Lm��K~��T����$������WD���7�X��3�gU�R+~f)�.pF��L��^��~��p>����Vs�X�Y�q������zr��=a����eV6�����jM�8��mSD��( r��(�O��e�Ht���@��c��=���=J�6��D!�;8�E<4 n�Tgyi����@��];��(�
����<z�UM%�<*���2������_<f-d+���'h%����.�EE���<:Km�y���`����&8����A������"5�:&�2���BP�����?)Q��=����[�?{�j�_0�����L���T����yr��(*5���O`��P�B�C� M2���:�cy����4M�f��7�K�<� �_z'������M?��t��R�W��~V���G�S��qG��&�c�b �*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=�b �����*B�=��AO���+�fJ���F6- ���#�&H��*?�!��������
�p�F�l��G/S���w�����1�+AV�` �q�pb����k/����e�����P.e�N1��.�M�[vW��9<K8�0�h��$d���9w���L�����jjYCJ��/~�^�&c�;S/Y��Z��"�w
�T�P1z��d�'