[Patch][WiP] Tweaked LRU for shared buffers
Hi, hackers!
We have held education project at Sirius edu center (Sochi, Russia) with mentors from Yandex. The group of 5 students was working on improving the shared buffers eviction algorithm: Andrey Chausov, Yuriy Skakovskiy, Ivan Edigaryev, Arslan Gumerov, Daria Filipetskaya. I’ve been a mentor for the group. For two weeks we have been looking into known caching algorithms and tried to adapt some of them for PostgreSQL codebase.
While a lot of algorithms appeared to be too complex to be hacked in 2 weeks, we decided to implement and test the working version of tweaked LRU eviction algorithm.
===How it works===
Most of the buffers are organized into the linked list. Firstly admitted pages jump into 5/8th of the queue. The last ⅛ of the queue is governed by clock sweep algorithm to improve concurrency.
===So how we tested the patch===
We used sampling on 4 Yandex.Cloud compute instances with 16 vCPU cores, 8GB of RAM, 200GB database in 30-minute YCSB-like runs with Zipf distribution. We found that on read-only workload our algorithm is showing consistent improvement over the current master branch. On read-write workloads we haven’t found performance improvements yet, there was too much noise from checkpoints and bgwriter (more on it in TODO section).
Charts are here: [0,1]
We used this config: [2]Benchmark config https://yadi.sk/d/L-PIVq--ujw6NA
===TODO===
We have taken some ideas expressed by Ben Manes in the pgsql-hackers list. But we could not implement all of them during the time of the program. For example, we tried to make LRU bumps less write-contentious by storing them in a circular buffer. But this feature was not stable enough.
The patch in its current form also requires improvements. So, we shall reduce the number of locks at all (in this way we have tried bufferization, but not only it). “Clock sweep” algorithm at the last ⅛ part of the logical sequence should be improved too (see ClockSweepTickAtTheAnd() and places of its usage).
Unfortunately, we didn’t have more time to bring CAR and W-TinyLFU to testing-ready state.
We have a working implementation of frequency sketch [3]Frequency sketch code https://github.com/heyfaraday/postgres/commit/a684a139a72cd50dd0f9d031a8aa77f998607cf1 to make a transition between the admission cycle and LRU more concise with TinyLFU filter. Most probably, work in this direction will be continued.
Also, the current patch does not change bgwriter behavior: with a piece of knowledge from LRU, we can predict that some pages will not be changed in the nearest future. This information should be used to schedule the background writes better.
We also think that with proper eviction algorithm shared buffers should be used instead of specialized buffer rings.
We will be happy to hear your feedback on our work! Thank you :)
[0]: LRU TPS https://yadi.sk/i/wNNmNw3id22nMQ
[1]: LRU hitrate https://yadi.sk/i/l1o710C3IX6BiA
[2]: Benchmark config https://yadi.sk/d/L-PIVq--ujw6NA
[3]: Frequency sketch code https://github.com/heyfaraday/postgres/commit/a684a139a72cd50dd0f9d031a8aa77f998607cf1
With best regards, almost serious cache group.
Attachments:
0001-Tweaked-LRU-for-shared-buffer-management.patchapplication/octet-stream; name=0001-Tweaked-LRU-for-shared-buffer-management.patch; x-unix-mode=0644Download
From 01087e0724c4184f21ad3b369b73a761f0b561a0 Mon Sep 17 00:00:00 2001
From: Skakovsky Yury <geymer_98@mail.ru>
Date: Wed, 6 Feb 2019 12:26:03 +0300
Subject: [PATCH] Tweaked LRU for shared buffer management
---
src/backend/storage/buffer/buf_init.c | 8 +
src/backend/storage/buffer/bufmgr.c | 4 +
src/backend/storage/buffer/freelist.c | 389 +++++++++++++++++++++-----
src/include/storage/buf_internals.h | 13 +
4 files changed, 350 insertions(+), 64 deletions(-)
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index b0ee3a26d6..7deb6bce66 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -126,6 +126,14 @@ InitBufferPool(void)
buf->buf_id = i;
+ buf->id_of_prev = i - 1;
+ buf->id_of_next = i + 1;
+ if (buf->id_of_next == NBuffers)
+ buf->id_of_next = NO_LOGICAL_NEIGHBOUR;
+
+ buf->beforeMid = (buf->buf_id < NBuffers * 5 / 8);
+ buf->inLiveZone = (buf->buf_id < NBuffers * 7 / 8);
+
/*
* Initially link all the buffers together as unused. Subsequent
* management of this list is done by freelist.c.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..de1a9f3037 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1051,6 +1051,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
*foundPtr = false;
}
}
+ else
+ {
+ RemoveBufferOnStart(buf);
+ }
return buf;
}
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 03caceaf7b..189bd34321 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -46,6 +46,35 @@ typedef struct
* when the list is empty)
*/
+ /*
+ * Index of 5/8-position buffer
+ */
+ int separatingBufferLogical;
+ /*
+ * Index of 7/8-position buffer
+ */
+ int leaderOfDeathZoneBufferLogical;
+ /*
+ * Index of 0/8-position buffer
+ */
+ int firstBufferLogical;
+ /*
+ * Index of last buffer
+ */
+ int lastBufferLogical;
+
+ /*
+ * NOTE: there is invariant that id_of_next of
+ * lastBufferLogical == -1 == NO_LOGICAL_NEIGHBOUT.
+ * The same sutiatin with id_of_prev of firstBufferLogical
+ */
+
+ /*
+ * Index of current buffer for "clock sweep" algorithm at the last 1/8.
+ * There is real index becasuse of in ClockSweepTickAtTheEnd() we check.
+ */
+ pg_atomic_uint32 curVictim;
+
/*
* Statistics. These counters should be wide enough that they can't
* overflow during a single bgwriter cycle.
@@ -103,69 +132,248 @@ static BufferDesc *GetBufferFromRing(BufferAccessStrategy strategy,
static void AddBufferToRing(BufferAccessStrategy strategy,
BufferDesc *buf);
-/*
- * ClockSweepTick - Helper routine for StrategyGetBuffer()
- *
- * Move the clock hand one buffer ahead of its current position and return the
- * id of the buffer now under the hand.
- */
+/* Push curVictim and returns it */
static inline uint32
-ClockSweepTick(void)
+ClockSweepTickAtTheEnd(void)
{
- uint32 victim;
+ uint32 oldVictim;
+ uint32 newVictim;
+ bool success = false;
+
+ while (!success) {
+ oldVictim = pg_atomic_read_u32(&StrategyControl->curVictim);
+ newVictim = GetBufferDescriptor(oldVictim)->id_of_next;
+ if (newVictim == NO_LOGICAL_NEIGHBOUR) {
+ newVictim = StrategyControl->leaderOfDeathZoneBufferLogical;
+ }
+ success = pg_atomic_compare_exchange_u32(&StrategyControl->curVictim, &oldVictim, newVictim);
+ }
+ return newVictim;
+}
+/*
+ * Changes logical indeces. So buf->buf_id will equals
+ * StrategyControl->firstBufferLogical. But not for the second
+ * buffers (see below).
+ *
+ * During the work we hold StrategyControl->buffer_strategy_lock.
+ */
+void
+RemoveBufferOnStart(BufferDesc* buf) {
+ BufferDesc* buf_next;
+ BufferDesc* buf_prev;
+ BufferDesc* currentMaster;
+ BufferDesc* currentLeaderOfDeathZone;
+ BufferDesc* currentSeparatingBuffer;
+
+ SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+ /*
+ * buf->id_of_prev == NO_LOGICAL_NEIGHBOUR
+ * this is case when we have found the first buffer yet
+ *
+ * buf->id_of_prev == StrategyControl->firstBufferLogical
+ * helps us to avoid extra code for this case (the end does not justify the means)
+ */
+ if (
+ buf->id_of_prev == NO_LOGICAL_NEIGHBOUR
+ || buf->id_of_prev == StrategyControl->firstBufferLogical
+ )
+ {
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ return;
+ }
+
+ /*
+ * Initialization of all the necessary variables
+ */
+ currentLeaderOfDeathZone = GetBufferDescriptor(StrategyControl->leaderOfDeathZoneBufferLogical);
+ currentSeparatingBuffer = GetBufferDescriptor(StrategyControl->separatingBufferLogical);
+ currentMaster = GetBufferDescriptor(StrategyControl->firstBufferLogical);
+
/*
- * Atomically move hand ahead one buffer - if there's several processes
- * doing this, this can lead to buffers being returned slightly out of
- * apparent order.
+ * We must consider this case to avoid segfault.
*/
- victim =
- pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
+ if (buf->id_of_next >= 0)
+ buf_next = GetBufferDescriptor(buf->id_of_next);
+ else
+ buf_next = NULL;
+ buf_prev = GetBufferDescriptor(buf->id_of_prev);
- if (victim >= NBuffers)
+ /*
+ * It seems that it is useless. But this commit is the most stable.
+ * Just in case we will not touch.
+ */
+ if (currentMaster == buf || currentMaster == buf_prev)
+ {
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ return;
+ }
+
+ /*
+ * update links at neighbors
+ */
+ if (buf_next != NULL)
+ {
+ buf_prev->id_of_next = buf->id_of_next;
+ buf_next->id_of_prev = buf->id_of_prev;
+ }
+ else
{
- uint32 originalVictim = victim;
+ buf_prev->id_of_next = NO_LOGICAL_NEIGHBOUR;
+ StrategyControl->lastBufferLogical = buf_prev->buf_id;
+ }
+
+ /*
+ * update other links
+ */
+ buf->id_of_prev = NO_LOGICAL_NEIGHBOUR;
+ buf->id_of_next = StrategyControl->firstBufferLogical;
+ currentMaster->id_of_prev = buf->buf_id;
- /* always wrap what we look up in BufferDescriptors */
- victim = victim % NBuffers;
+ /*
+ * update firstBufferLogical
+ */
+ StrategyControl->firstBufferLogical = buf->buf_id;
- /*
- * If we're the one that just caused a wraparound, force
- * completePasses to be incremented while holding the spinlock. We
- * need the spinlock so StrategySyncStart() can return a consistent
- * value consisting of nextVictimBuffer and completePasses.
- */
- if (victim == 0)
+ /*
+ * update boolean field 'inLiveZone' and leaderOfDeathZoneBufferLogical
+ * if it is necessary, that is buf was situated in the last 1/8 of
+ * logical buffer.
+ */
+ if (!buf->inLiveZone) {
+ buf->inLiveZone = true;
+ buf->beforeMid = true;
+
+ if (currentLeaderOfDeathZone == buf)
{
- uint32 expected;
- uint32 wrapped;
- bool success = false;
-
- expected = originalVictim + 1;
+ StrategyControl->leaderOfDeathZoneBufferLogical = buf_prev->buf_id;
+ buf_prev->inLiveZone = false;
+ }
+ else
+ {
+ StrategyControl->leaderOfDeathZoneBufferLogical = currentLeaderOfDeathZone->id_of_prev;
+ GetBufferDescriptor(StrategyControl->leaderOfDeathZoneBufferLogical)->inLiveZone = false;
+ }
+ }
- while (!success)
- {
- /*
- * Acquire the spinlock while increasing completePasses. That
- * allows other readers to read nextVictimBuffer and
- * completePasses in a consistent manner which is required for
- * StrategySyncStart(). In theory delaying the increment
- * could lead to an overflow of nextVictimBuffers, but that's
- * highly unlikely and wouldn't be particularly harmful.
- */
- SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ /*
+ * Similarly with position 5/8.
+ * Note that (!buf->inLiveZone) excepts (!buf->beforeMid)
+ */
+ if (!buf->beforeMid) {
+ buf->beforeMid = true;
+
+ if (currentSeparatingBuffer == buf)
+ {
+ StrategyControl->separatingBufferLogical = buf_prev->buf_id;
+ buf_prev->beforeMid = false;
+ }
+ else
+ {
+ StrategyControl->separatingBufferLogical = currentSeparatingBuffer->id_of_prev;
+ GetBufferDescriptor(StrategyControl->separatingBufferLogical)->beforeMid = false;
+ }
+ }
+
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+}
- wrapped = expected % NBuffers;
+/*
+ * All the code is similar to RemoveBufferOnStart.
+ *
+ * P.S. it is copypaste and should be merged to one function.
+ * P.P.S. we will lose some performance after the merger.
+ *
+ * TODO
+ */
+void
+RemoveBufferOnSeparatingPosition(BufferDesc* buf) {
+ BufferDesc* buf_next;
+ BufferDesc* buf_prev;
+ BufferDesc* currentMaster;
+ BufferDesc* master_prev;
+ BufferDesc* currentLeaderOfDeathZone;
+
+ SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+
+ if (
+ buf->buf_id == StrategyControl->separatingBufferLogical
+ || buf->id_of_prev == StrategyControl->separatingBufferLogical
+ || buf->beforeMid
+ )
+ {
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ return;
+ }
+
+ /*
+ * Init
+ */
+ currentLeaderOfDeathZone = GetBufferDescriptor(StrategyControl->leaderOfDeathZoneBufferLogical);
+ currentMaster = GetBufferDescriptor(StrategyControl->separatingBufferLogical);
+
+ if (buf->id_of_next >= 0)
+ buf_next = GetBufferDescriptor(buf->id_of_next);
+ else
+ buf_next = NULL;
+
+ buf_prev = GetBufferDescriptor(buf->id_of_prev);
+
+ master_prev = GetBufferDescriptor(currentMaster->id_of_prev);
- success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,
- &expected, wrapped);
- if (success)
- StrategyControl->completePasses++;
- SpinLockRelease(&StrategyControl->buffer_strategy_lock);
- }
- }
+ /*
+ * More efficiently to skip bump if we want to bump only
+ * on one position. Moreover, code is clearer in this way.
+ */
+ if (currentMaster == buf_prev)
+ {
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
+ return;
+ }
+
+ /*
+ * We just do it here because of all the next operations do not
+ * touch currentLeaderOfDeathZone.
+ */
+ if (currentLeaderOfDeathZone == buf) {
+ buf->inLiveZone = true;
+ buf_prev->inLiveZone = false;
+
+ StrategyControl->leaderOfDeathZoneBufferLogical = currentLeaderOfDeathZone->id_of_prev;
+ }
+
+ if (buf->id_of_next >= 0)
+ {
+ buf_next->id_of_prev = buf_prev->buf_id;
+ buf_prev->id_of_next = buf_next->buf_id;
}
- return victim;
+ else
+ {
+ buf_prev->id_of_next = NO_LOGICAL_NEIGHBOUR;
+
+ StrategyControl->lastBufferLogical = buf_prev->buf_id;
+ }
+
+ buf->id_of_next = StrategyControl->separatingBufferLogical;
+ buf->id_of_prev = master_prev->buf_id;
+
+ currentMaster->id_of_prev = buf->buf_id;
+ master_prev->id_of_next = buf->buf_id;
+
+ StrategyControl->separatingBufferLogical = buf->buf_id;
+
+ /*
+ * Now we consider only case when buf != currentLeaderOfDeathZone
+ * (see that case (by the way excepting current case) earlier).
+ */
+ if (!buf->inLiveZone)
+ {
+ GetBufferDescriptor(currentLeaderOfDeathZone->id_of_prev)->inLiveZone = false;
+ StrategyControl->leaderOfDeathZoneBufferLogical = currentLeaderOfDeathZone->id_of_prev;
+ }
+
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
/*
@@ -204,7 +412,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
int bgwprocno;
int trycounter;
uint32 local_buf_state; /* to avoid repeated (de-)referencing */
-
+ int victimCandidate;
+ BufferDesc *tempBuf;
+
/*
* If given a strategy object, see whether it can select a buffer. We
* assume strategy objects don't need buffer_strategy_lock.
@@ -284,7 +494,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
/* Unconditionally remove buffer from freelist */
StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST;
-
+
+ StrategyControl->lastBufferLogical = buf->buf_id;
+
/*
* Release the lock so someone else can access the freelist while
* we check out this buffer.
@@ -305,6 +517,13 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
if (strategy != NULL)
AddBufferToRing(strategy, buf);
*buf_state = local_buf_state;
+
+ /*
+ * There is a reason here too to make a bump buffer,
+ * but we did not have time to test it.
+ * TODO
+ */
+
return buf;
}
UnlockBufHdr(buf, local_buf_state);
@@ -312,11 +531,14 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
}
}
- /* Nothing on the freelist, so run the "clock sweep" algorithm */
+ /* Nothing on the freelist, so run the "LRU5/8" algorithm */
trycounter = NBuffers;
+ SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ victimCandidate = StrategyControl->lastBufferLogical;
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
for (;;)
- {
- buf = GetBufferDescriptor(ClockSweepTick());
+ {
+ buf = GetBufferDescriptor(victimCandidate);
/*
* If the buffer is pinned or has a nonzero usage_count, we cannot use
@@ -326,11 +548,19 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
{
+ /*
+ * usage_count is useless after replacements "clock sweep",
+ * TODO
+ */
if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
{
local_buf_state -= BUF_USAGECOUNT_ONE;
trycounter = NBuffers;
+
+ SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ victimCandidate = StrategyControl->lastBufferLogical;
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
else
{
@@ -338,20 +568,38 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
if (strategy != NULL)
AddBufferToRing(strategy, buf);
*buf_state = local_buf_state;
+
+ RemoveBufferOnSeparatingPosition(buf);
+
return buf;
}
}
- else if (--trycounter == 0)
+ else
{
+
+ if (--trycounter == 0)
+ {
+ /*
+ * We've scanned all the buffers without making any state changes,
+ * so all the buffers are pinned (or were when we looked at them).
+ * We could hope that someone will free one eventually, but it's
+ * probably better to fail than to risk getting stuck in an
+ * infinite loop.
+ */
+ UnlockBufHdr(buf, local_buf_state);
+
+ elog(ERROR, "no unpinned buffers available");
+ }
/*
- * We've scanned all the buffers without making any state changes,
- * so all the buffers are pinned (or were when we looked at them).
- * We could hope that someone will free one eventually, but it's
- * probably better to fail than to risk getting stuck in an
- * infinite loop.
- */
- UnlockBufHdr(buf, local_buf_state);
- elog(ERROR, "no unpinned buffers available");
+ * In a good scenario, lock is not needed here: we use atomic variables
+ * TODO
+ * And "% NBuffers" is useless to: it seems that there is no bug in
+ * ClockSweepTickAtTheEnd(), it returns valid buf_id always
+ * TODO
+ */
+ SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
+ victimCandidate = ClockSweepTickAtTheEnd() % NBuffers;
+ SpinLockRelease(&StrategyControl->buffer_strategy_lock);
}
UnlockBufHdr(buf, local_buf_state);
}
@@ -512,9 +760,22 @@ StrategyInitialize(bool init)
StrategyControl->firstFreeBuffer = 0;
StrategyControl->lastFreeBuffer = NBuffers - 1;
+ /*
+ * Initially first/lastBufferLogical coincide with the physical ones.
+ */
+ StrategyControl->firstBufferLogical = 0;
+ StrategyControl->lastBufferLogical = NBuffers - 1;
+
+ /*
+ * Initialization of separating buffers.
+ */
+ StrategyControl->separatingBufferLogical = NBuffers * 5 / 8;
+ StrategyControl->leaderOfDeathZoneBufferLogical = NBuffers * 7 / 8;
+
/* Initialize the clock sweep pointer */
pg_atomic_init_u32(&StrategyControl->nextVictimBuffer, 0);
-
+ pg_atomic_init_u32(&StrategyControl->curVictim, StrategyControl->leaderOfDeathZoneBufferLogical);
+
/* Clear statistics */
StrategyControl->completePasses = 0;
pg_atomic_init_u32(&StrategyControl->numBufferAllocs, 0);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ba1b5463fc..95b392c912 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -186,6 +186,14 @@ typedef struct BufferDesc
int wait_backend_pid; /* backend PID of pin-count waiter */
int freeNext; /* link in freelist chain */
+ /* Indeces of next and previous buffers in logical sequence */
+ int id_of_next;
+ int id_of_prev;
+
+ /* Not including separating buffers */
+ bool beforeMid;
+ bool inLiveZone;
+
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -236,6 +244,11 @@ extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
#define FREENEXT_END_OF_LIST (-1)
#define FREENEXT_NOT_IN_LIST (-2)
+/*
+ * The id_of_next/id_of_prev is either the index of the next freelist entry or:
+ */
+#define NO_LOGICAL_NEIGHBOUR (-1)
+
/*
* Functions for acquiring/releasing a shared buffer header's spinlock. Do
* not apply these to local buffers!
--
2.20.1
Hi,
On 2/13/19 3:37 PM, Andrey Borodin wrote:
Hi, hackers!
We have held education project at Sirius edu center (Sochi, Russia)
with mentors from Yandex. The group of 5 students was working on
improving the shared buffers eviction algorithm: Andrey Chausov, Yuriy
Skakovskiy, Ivan Edigaryev, Arslan Gumerov, Daria Filipetskaya. I’ve
been a mentor for the group. For two weeks we have been looking into
known caching algorithms and tried to adapt some of them for PostgreSQL
codebase.While a lot of algorithms appeared to be too complex to be hacked in
2 weeks, we decided to implement and test the working version of
tweaked LRU eviction algorithm.===How it works===
Most of the buffers are organized into the linked list. Firstly
admitted pages jump into 5/8th of the queue. The last ⅛ of the queue is
governed by clock sweep algorithm to improve concurrency.
Interesting. Where do these numbers (5/8 and 1/8) come from?
===So how we tested the patch===
We used sampling on 4 Yandex.Cloud compute instances with 16 vCPU
cores, 8GB of RAM, 200GB database in 30-minute YCSB-like runs with Zipf
distribution. We found that on read-only workload our algorithm is
showing consistent improvement over the current master branch. On
read-write workloads we haven’t found performance improvements yet,
there was too much noise from checkpoints and bgwriter (more on it in
TODO section).
Charts are here: [0,1]
That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.
How do I reproduce this benchmark? I'm aware of pg_ycsb, but maybe
you've used some other tools?
Also, have you tried some other benchmarks (like, regular TPC-B as
implemented by pgbench, or read-only pgbench)? We need such benchmarks
with a range of access patterns to check for regressions.
BTW what do you mean by "sampling"?
We used this config: [2]
That's only half the information - it doesn't say how many clients were
running the benchmark etc.
===TODO===
We have taken some ideas expressed by Ben Manes in the pgsql-hackers
list. But we could not implement all of them during the time of the
program. For example, we tried to make LRU bumps less write-contentious
by storing them in a circular buffer. But this feature was not stable
enough.
Can you point us to the thread/email discussing those ideas? I have
tried searching through archives, but I haven't found anything :-(
This message does not really explain the algorithms, and combined with
the absolute lack of comments in the linked commit, it'd somewhat
difficult to form an opinion.
The patch in its current form also requires improvements. So, we
shall reduce the number of locks at all (in this way we have tried
bufferization, but not only it). “Clock sweep” algorithm at the last
⅛ part of the logical sequence should be improved too (see
ClockSweepTickAtTheAnd() and places of its usage).
OK
Unfortunately, we didn’t have more time to bring CAR and W-TinyLFU
to testing-ready state.
What is CAR? Did you mean ARC, perhaps?
We have a working implementation of frequency sketch [3] to make a
transition between the admission cycle and LRU more concise with TinyLFU
filter. Most probably, work in this direction will be continued.
OK
Also, the current patch does not change bgwriter behavior: with a
piece of knowledge from LRU, we can predict that some pages will not be
changed in the nearest future. This information should be used to
schedule the background writes better.
Sounds interesting.
We also think that with proper eviction algorithm shared buffers
should be used instead of specialized buffer rings.
Are you suggesting to get rid of the buffer rings we use for sequential
scans, for example? IMHO that's going to be tricky, e.g. because of
synchronized sequential scans.
We will be happy to hear your feedback on our work! Thank you :)
[0] LRU TPS https://yadi.sk/i/wNNmNw3id22nMQ
[1] LRU hitrate https://yadi.sk/i/l1o710C3IX6BiA
[2] Benchmark config https://yadi.sk/d/L-PIVq--ujw6NA
[3] Frequency sketch code https://github.com/heyfaraday/postgres/commit/a684a139a72cd50dd0f9d031a8aa77f998607cf1With best regards, almost serious cache group.
cheers
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 15, 2019 at 3:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.
Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.
We know that the cache replacement algorithm behaves randomly when
there is extreme contention, while also slowing everything down due to
maintaining the clock. A unambiguously better caching algorithm would
at a minimum be able to beat our "cheap random replacement" prototype
as well as the existing clocksweep algorithm in most or all cases.
That seems like a reasonably good starting point, at least.
--
Peter Geoghegan
On 2/16/19 12:51 AM, Peter Geoghegan wrote:
On Fri, Feb 15, 2019 at 3:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.We know that the cache replacement algorithm behaves randomly when
there is extreme contention, while also slowing everything down due
to maintaining the clock.
Possibly, although I still find it strange that the throughput first
grows, then at shared_buffers 1GB it drops, and then at 3GB it starts
growing again. Considering this is on 200GB data set, I doubt the
pressure/contention is much different with 1GB and 3GB, but maybe it is.
A unambiguously better caching algorithm would at a minimum be able
to beat our "cheap random replacement" prototype as well as the
existing clocksweep algorithm in most or all cases. That seems like a
reasonably good starting point, at least.
Yes, comparison to cheap random replacement would be an interesting
experiment.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I was not involved with Andrey and his team's work, which looks like a very
promising first pass. I can try to clarify a few minor details.
What is CAR? Did you mean ARC, perhaps?
CAR is the Clock variants of ARC: CAR: Clock with Adaptive Replacement
<https://www.usenix.org/legacy/publications/library/proceedings/fast04/tech/full_papers/bansal/bansal.pdf>
I believe the main interest in ARC is its claim of adaptability with high
hit rates. Unfortunately the reality is less impressive as it fails to
handle many frequency workloads, e.g. poor scan resistance, and the
adaptivity is poor due to the accesses being quite noisy. For W-TinyLFU, we
have recent improvements which show near perfect adaptivity
<https://user-images.githubusercontent.com/378614/52891655-2979e380-3141-11e9-91b3-00002a3cc80b.png>
in
our stress case that results in double the hit rate of ARC and is less than
1% from optimal.
Can you point us to the thread/email discussing those ideas? I have tried
searching through archives, but I haven't found anything :-(
This thread
</messages/by-id/1526057854777-0.post@n3.nabble.com>
doesn't explain Andrey's work, but includes my minor contribution. The
longer thread discusses the interest in CAR, et al.
Are you suggesting to get rid of the buffer rings we use for sequential
scans, for example? IMHO that's going to be tricky, e.g. because of
synchronized sequential scans.
If you mean "synchronized" in terms of multi-threading and locks, then this
is a solved problem
<http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html>
in terms of caching. My limited understanding is that the buffer rings are
used to protect the cache from being polluted by scans which flush the
LRU-like algorithms. This allows those policies to capture more frequent
items. It also avoids lock contention on the cache due to writes caused by
misses, where Clock allows lock-free reads but uses a global lock on
writes. A smarter cache eviction policy and concurrency model can handle
this without needing buffer rings to compensate.
Somebody should write a patch to make buffer eviction completely random,
without aiming to get it committed. That isn't as bad of a strategy as it
sounds, and it would help with assessing improvements in this area.
A related and helpful patch would be to capture the access log and provide
anonymized traces. We have a simulator
<https://github.com/ben-manes/caffeine/wiki/Simulator> with dozens of
policies to quickly provide a breakdown. That would let you know the hit
rates before deciding on the policy to adopt.
Cheers.
On Fri, Feb 15, 2019 at 4:22 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Show quoted text
On 2/16/19 12:51 AM, Peter Geoghegan wrote:
On Fri, Feb 15, 2019 at 3:30 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.We know that the cache replacement algorithm behaves randomly when
there is extreme contention, while also slowing everything down due
to maintaining the clock.Possibly, although I still find it strange that the throughput first
grows, then at shared_buffers 1GB it drops, and then at 3GB it starts
growing again. Considering this is on 200GB data set, I doubt the
pressure/contention is much different with 1GB and 3GB, but maybe it is.A unambiguously better caching algorithm would at a minimum be able
to beat our "cheap random replacement" prototype as well as the
existing clocksweep algorithm in most or all cases. That seems like a
reasonably good starting point, at least.Yes, comparison to cheap random replacement would be an interesting
experiment.regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/16/19 1:48 AM, Benjamin Manes wrote:
Hi,
I was not involved with Andrey and his team's work, which looks like a
very promising first pass. I can try to clarify a few minor details.What is CAR? Did you mean ARC, perhaps?
CAR is the Clock variants of ARC: CAR: Clock with Adaptive Replacement
<https://www.usenix.org/legacy/publications/library/proceedings/fast04/tech/full_papers/bansal/bansal.pdf>
Thanks, will check.
I believe the main interest in ARC is its claim of adaptability with
high hit rates. Unfortunately the reality is less impressive as it fails
to handle many frequency workloads, e.g. poor scan resistance, and the
adaptivity is poor due to the accesses being quite noisy. For W-TinyLFU,
we have recent improvements which show near perfect adaptivity
<https://user-images.githubusercontent.com/378614/52891655-2979e380-3141-11e9-91b3-00002a3cc80b.png> in
our stress case that results in double the hit rate of ARC and is less
than 1% from optimal.
Interesting.
Can you point us to the thread/email discussing those ideas? I have
tried searching through archives, but I haven't found anything :-(This thread
</messages/by-id/1526057854777-0.post@n3.nabble.com>
doesn't explain Andrey's work, but includes my minor contribution. The
longer thread discusses the interest in CAR, et al.
Thanks.
Are you suggesting to get rid of the buffer rings we use for
sequential scans, for example? IMHO that's going to be tricky, e.g.
because of synchronized sequential scans.If you mean "synchronized" in terms of multi-threading and locks, then
this is a solved problem
<http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html> in
terms of caching.
No, I "synchronized scans" are an optimization to reduce I/O when
multiple processes do sequential scan on the same table. Say one process
is half-way through the table, when another process starts another scan.
Instead of scanning it from block #0 we start at the position of the
first process (at which point they "synchronize") and then wrap around
to read the beginning.
I was under the impression that this somehow depends on the small
circular buffers, but I've now checked the code and I see that's bogus.
My limited understanding is that the buffer rings are
used to protect the cache from being polluted by scans which flush the
LRU-like algorithms. This allows those policies to capture more frequent
items. It also avoids lock contention on the cache due to writes caused
by misses, where Clock allows lock-free reads but uses a global lock on
writes. A smarter cache eviction policy and concurrency model can handle
this without needing buffer rings to compensate.
Right - that's the purpose of circular buffers.
Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.A related and helpful patch would be to capture the access log and
provide anonymized traces. We have a simulator
<https://github.com/ben-manes/caffeine/wiki/Simulator> with dozens of
policies to quickly provide a breakdown. That would let you know the hit
rates before deciding on the policy to adopt.
Interesting. I assume the trace is essentially a log of which blocks
were requested? Is there some trace format specification somewhere?
cheers
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
No, I "synchronized scans" are an optimization to reduce I/O when multiple
processes do sequential scan on the same table.
Oh, very neat. Thanks!
Interesting. I assume the trace is essentially a log of which blocks were
requested? Is there some trace format specification somewhere?
Yes, whatever constitutes a cache key (block address, item hash, etc). We
write parsers for each trace so there isn't a format requirement. The
parsers produce a 64-bit int stream of keys, which are broadcasted to each
policy via an actor framework. The trace files can be text or binary,
optionally compressed (xz preferred). The ARC traces are block I/O and this
is their format description
<https://www.dropbox.com/sh/9ii9sc7spcgzrth/j1CJ72HiWa/Papers/ARCTraces/README.txt>,
so perhaps something similar? That parser is only 5 lines of custom code
<https://github.com/ben-manes/caffeine/blob/b752c586f7bf143f774a51a0a10593ae3b77802b/simulator/src/main/java/com/github/benmanes/caffeine/cache/simulator/parser/arc/ArcTraceReader.java#L36-L42>
.
On Fri, Feb 15, 2019 at 5:51 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:
Show quoted text
On 2/16/19 1:48 AM, Benjamin Manes wrote:
Hi,
I was not involved with Andrey and his team's work, which looks like a
very promising first pass. I can try to clarify a few minor details.What is CAR? Did you mean ARC, perhaps?
CAR is the Clock variants of ARC: CAR: Clock with Adaptive Replacement
<Thanks, will check.
I believe the main interest in ARC is its claim of adaptability with
high hit rates. Unfortunately the reality is less impressive as it fails
to handle many frequency workloads, e.g. poor scan resistance, and the
adaptivity is poor due to the accesses being quite noisy. For W-TinyLFU,
we have recent improvements which show near perfect adaptivity
<https://user-images.githubusercontent.com/378614/52891655-2979e380-3141-11e9-91b3-00002a3cc80b.png
in
our stress case that results in double the hit rate of ARC and is less
than 1% from optimal.Interesting.
Can you point us to the thread/email discussing those ideas? I have
tried searching through archives, but I haven't found anything :-(This thread
</messages/by-id/1526057854777-0.post@n3.nabble.com
doesn't explain Andrey's work, but includes my minor contribution. The
longer thread discusses the interest in CAR, et al.Thanks.
Are you suggesting to get rid of the buffer rings we use for
sequential scans, for example? IMHO that's going to be tricky, e.g.
because of synchronized sequential scans.If you mean "synchronized" in terms of multi-threading and locks, then
this is a solved problem
<http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html>in
terms of caching.
No, I "synchronized scans" are an optimization to reduce I/O when
multiple processes do sequential scan on the same table. Say one process
is half-way through the table, when another process starts another scan.
Instead of scanning it from block #0 we start at the position of the
first process (at which point they "synchronize") and then wrap around
to read the beginning.I was under the impression that this somehow depends on the small
circular buffers, but I've now checked the code and I see that's bogus.My limited understanding is that the buffer rings are
used to protect the cache from being polluted by scans which flush the
LRU-like algorithms. This allows those policies to capture more frequent
items. It also avoids lock contention on the cache due to writes caused
by misses, where Clock allows lock-free reads but uses a global lock on
writes. A smarter cache eviction policy and concurrency model can handle
this without needing buffer rings to compensate.Right - that's the purpose of circular buffers.
Somebody should write a patch to make buffer eviction completely
random, without aiming to get it committed. That isn't as bad of a
strategy as it sounds, and it would help with assessing improvements
in this area.A related and helpful patch would be to capture the access log and
provide anonymized traces. We have a simulator
<https://github.com/ben-manes/caffeine/wiki/Simulator> with dozens of
policies to quickly provide a breakdown. That would let you know the hit
rates before deciding on the policy to adopt.Interesting. I assume the trace is essentially a log of which blocks
were requested? Is there some trace format specification somewhere?cheers
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Benjamin> A related and helpful patch would be to capture the access log and
Benjamin> provide anonymized traces.
The traces can be captured via DTrace scripts, so no patch is required here.
For instance:
/messages/by-id/CAB=Je-F_BhGfBu1sO1H7u_XMtvak=BQtuJFyv8cfjGBRp7Q_yA@mail.gmail.com
or
/messages/by-id/CAH2-WzmbUWKvCqjDycpCOSF==PEswVf6WtVutgm9efohH0NfHA@mail.gmail.com
The missing bit is a database with more-or-less relevant workload.
Vladimir
On 2/16/19 10:36 AM, Vladimir Sitnikov wrote:
Benjamin> A related and helpful patch would be to capture the access log and
Benjamin> provide anonymized traces.The traces can be captured via DTrace scripts, so no patch is required here.
Right. Or a BPF on reasonably new linux kernels.
For instance:
/messages/by-id/CAB=Je-F_BhGfBu1sO1H7u_XMtvak=BQtuJFyv8cfjGBRp7Q_yA@mail.gmail.com
or
/messages/by-id/CAH2-WzmbUWKvCqjDycpCOSF==PEswVf6WtVutgm9efohH0NfHA@mail.gmail.comThe missing bit is a database with more-or-less relevant workload.
I think it'd be sufficient (or at least reasonable first step) to get
traces from workloads regularly used for benchmarking (different flavors
of pgbench workload, YCSB, TPC-H/TPC-DS and perhaps something else).
A good algorithm has to perform well in those anyway, and applications
generally can be modeled as a mix of those simple workloads.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi there. I was responsible for the benchmarks, and I would be glad to
make clear that part for you.
On Sat, 16 Feb 2019 at 02:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Interesting. Where do these numbers (5/8 and 1/8) come from?
The first number came from MySQL realization of LRU algorithm
<https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html>
and the second from simple tuning, we've tried to change 1/8 a little,
but it didn't change metrics significantly.
That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.
Yes, it is. It would be great if someone will try to reproduce those results.
How do I reproduce this benchmark? I'm aware of pg_ycsb, but maybe
you've used some other tools?
Yes, we used pg_ycsb, but nothing more than that and pgbench, it's
just maybe too simple. I attach the example of the sh script that has
been used to generate database and measure each point on the chart.
Build was generated without any additional debug flags in
configuration. Database was mad by initdb with --data-checksums
enabled and generated by initialization step in pgbench with
--scale=11000.
Also, have you tried some other benchmarks (like, regular TPC-B as
implemented by pgbench, or read-only pgbench)? We need such benchmarks
with a range of access patterns to check for regressions.
Yes, we tried all builtin pgbench benchmarks and YCSB-A,B,C from
pg_ycsb with unoform and zipfian distribution. I also attach some
other charts that we did, they are not as statistically significant as
could because we used less time in them but hope that it will help.
BTW what do you mean by "sampling"?
I meant that we measure tps and hit rate on several virtual machines
for our and master build in order to neglect the influence that came
from the difference between them.
We used this config: [2]
That's only half the information - it doesn't say how many clients were
running the benchmark etc.
Yes, sorry for that missing, we've had virtual machines with
configuration mentioned in the initial letter, with 16 jobs and 16
clients in pgbench configuration.
[0]: Scripts https://yadi.sk/d/PHICP0N6YrN5Cw
[1]: Measurements for other workloads https://yadi.sk/d/6G0e09Drf0ygag
I will be looking forward if you have any other questions about
measurement or code. Please note me if you have them.
Best regards.
--
Ivan Edigaryev
On 2/17/19 2:14 PM, Едигарьев, Иван Григорьевич wrote:
Hi there. I was responsible for the benchmarks, and I would be glad to
make clear that part for you.On Sat, 16 Feb 2019 at 02:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Interesting. Where do these numbers (5/8 and 1/8) come from?
The first number came from MySQL realization of LRU algorithm
<https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html>
and the second from simple tuning, we've tried to change 1/8 a little,
but it didn't change metrics significantly.That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.Yes, it is. It would be great if someone will try to reproduce those results.
I'll try.
How do I reproduce this benchmark? I'm aware of pg_ycsb, but maybe
you've used some other tools?Yes, we used pg_ycsb, but nothing more than that and pgbench, it's
just maybe too simple. I attach the example of the sh script that has
been used to generate database and measure each point on the chart.
OK. I see the measurement script is running plain select-only test (so
with uniform distribution). So how did you do the zipfian test?
Also, I see the test is resetting all stats regularly - that may not be
quite a good thing, because it also resets stats used by autovacuum for
example. It's better to query the values before the run, and compute the
delta. OTOH it should affect all tests about the same, I guess.
Build was generated without any additional debug flags in
configuration. Database was mad by initdb with --data-checksums
enabled and generated by initialization step in pgbench with
--scale=11000.
Sounds reasonable.
Also, have you tried some other benchmarks (like, regular TPC-B as
implemented by pgbench, or read-only pgbench)? We need such benchmarks
with a range of access patterns to check for regressions.Yes, we tried all builtin pgbench benchmarks and YCSB-A,B,C from
pg_ycsb with unoform and zipfian distribution. I also attach some
other charts that we did, they are not as statistically significant as
could because we used less time in them but hope that it will help.
OK. Notice that the random_zipfian function is quite expensive in terms
of CPU, and it may significantly skew the results particularly with
large data sets / short runs. I've posted a message earlier, as I ran
into the issue while trying to reproduce the behavior.
BTW what do you mean by "sampling"?
I meant that we measure tps and hit rate on several virtual machines
for our and master build in order to neglect the influence that came
from the difference between them.
OK
We used this config: [2]
That's only half the information - it doesn't say how many clients were
running the benchmark etc.Yes, sorry for that missing, we've had virtual machines with
configuration mentioned in the initial letter, with 16 jobs and 16
clients in pgbench configuration.[0] Scripts https://yadi.sk/d/PHICP0N6YrN5Cw
[1] Measurements for other workloads https://yadi.sk/d/6G0e09Drf0ygagI will be looking forward if you have any other questions about
measurement or code. Please note me if you have them.
Thanks. I'll try running some tests on my machines, I'll see if I manage
to reproduce the suspicious behavior.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/17/19 2:53 PM, Tomas Vondra wrote:
On 2/17/19 2:14 PM, Едигарьев, Иван Григорьевич wrote:
Hi there. I was responsible for the benchmarks, and I would be glad to
make clear that part for you.On Sat, 16 Feb 2019 at 02:30, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Interesting. Where do these numbers (5/8 and 1/8) come from?
The first number came from MySQL realization of LRU algorithm
<https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html>
and the second from simple tuning, we've tried to change 1/8 a little,
but it didn't change metrics significantly.That TPS chart looks a bit ... wild. How come the master jumps so much
up and down? That's a bit suspicious, IMHO.Yes, it is. It would be great if someone will try to reproduce those results.
I'll try.
I've tried to reproduce this behavior, and I've done a quite extensive
set of tests on two different (quite different) machines, but so far I
have not observed anything like that. The results are attached, along
with the test scripts used.
I wonder if this might be due to pg_ycsb using random_zipfian, which has
somewhat annoying behavior for some parameters (as I've mentioned in a
separate thread). But that should affect all the runs, not just some
shared_buffers sizes.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
ycsb.odsapplication/vnd.oasis.opendocument.spreadsheet; name=ycsb.odsDownload
PK 8�ZN�l9�. . mimetypeapplication/vnd.oasis.opendocument.spreadsheetPK 8�ZN ���� � Thumbnails/thumbnail.png�PNG
IHDR � � ��W"