POC: Cache data in GetSnapshotData()
Hi,
I've, for a while, pondered whether we couldn't find a easier way than
CSN to make snapshots cheaper as GetSnapshotData() very frequently is
one of the top profile entries. Especially on bigger servers, where the
pretty much guaranteed cachemisses are quite visibile.
My idea is based on the observation that even in very write heavy
environments the frequency of relevant PGXACT changes is noticeably
lower than GetSnapshotData() calls.
My idea is to simply cache the results of a GetSnapshotData() result in
shared memory and invalidate it everytime something happens that affects
the results. Then GetSnapshotData() can do a couple of memcpy() calls to
get the snapshot - which will be significantly faster in a large number
of cases. For one often enough there's many transactions without an xid
assigned (and thus xip/subxip are small), for another, even if that's
not the case it's linear copies instead of unpredicable random accesses
through PGXACT/PGPROC.
Now, that idea is pretty handwavy. After talking about it with a couple
of people I've decided to write a quick POC to check whether it's
actually beneficial. That POC isn't anything close to being ready or
complete. I just wanted to evaluate whether the idea has some merit or
not. That said, it survives make installcheck-parallel.
Some very preliminary performance results indicate a growth of between
25% (pgbench -cj 796 -m prepared -f 'SELECT 1'), 15% (pgbench -s 300 -S
-cj 796), 2% (pgbench -cj 96 -s 300) on a 4 x E5-4620 system. Even on my
laptop I can measure benefits in a readonly, highly concurrent,
workload; although unsurprisingly much smaller.
Now, these are all somewhat extreme workloads, but still. It's a nice
improvement for a quick POC.
So far the implemented idea is to just completely wipe the cached
snapshot everytime somebody commits. I've afterwards not been able to
see GetSnapshotData() in the profile at all - so that possibly is
actually sufficient?
This implementation probably has major holes. Like it probably ends up
not really increasing the xmin horizon when a longrunning readonly
transaction without an xid commits...
Comments about the idea?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Heavily-WIP-Cache-snapshots-in-GetSnapshotData.patchtext/x-patch; charset=us-asciiDownload
>From 3f800e9363909d2fcf80cb5f9b4f68579a3cb328 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 1 Feb 2015 21:04:42 +0100
Subject: [PATCH] Heavily-WIP: Cache snapshots in GetSnapshotData()
---
src/backend/commands/cluster.c | 1 +
src/backend/storage/ipc/procarray.c | 67 ++++++++++++++++++++++++++++++++-----
src/backend/storage/lmgr/proc.c | 13 +++++++
src/include/storage/proc.h | 6 ++++
4 files changed, 78 insertions(+), 9 deletions(-)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dc1b37c..3def86a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1558,6 +1558,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
elog(ERROR, "cache lookup failed for relation %u", OIDOldHeap);
relform = (Form_pg_class) GETSTRUCT(reltup);
+ Assert(TransactionIdIsNormal(frozenXid));
relform->relfrozenxid = frozenXid;
relform->relminmxid = cutoffMulti;
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index a1ebc72..66be489 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -421,6 +421,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+ ProcGlobal->cached_snapshot_valid = false;
+
LWLockRelease(ProcArrayLock);
}
else
@@ -1403,6 +1405,8 @@ GetSnapshotData(Snapshot snapshot)
errmsg("out of memory")));
}
+ snapshot->takenDuringRecovery = RecoveryInProgress();
+
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyPgXact->xmin.
@@ -1417,9 +1421,32 @@ GetSnapshotData(Snapshot snapshot)
/* initialize xmin calculation with xmax */
globalxmin = xmin = xmax;
- snapshot->takenDuringRecovery = RecoveryInProgress();
+ if (!snapshot->takenDuringRecovery && ProcGlobal->cached_snapshot_valid)
+ {
+ Snapshot csnap = &ProcGlobal->cached_snapshot;
+ TransactionId *saved_xip;
+ TransactionId *saved_subxip;
+
+ saved_xip = snapshot->xip;
+ saved_subxip = snapshot->subxip;
+
+ memcpy(snapshot, csnap, sizeof(SnapshotData));
+
+ snapshot->xip = saved_xip;
+ snapshot->subxip = saved_subxip;
+
+ memcpy(snapshot->xip, csnap->xip,
+ sizeof(TransactionId) * csnap->xcnt);
+ memcpy(snapshot->subxip, csnap->subxip,
+ sizeof(TransactionId) * csnap->subxcnt);
- if (!snapshot->takenDuringRecovery)
+ globalxmin = ProcGlobal->cached_snapshot_globalxmin;
+ xmin = csnap->xmin;
+
+ Assert(TransactionIdIsValid(globalxmin));
+ Assert(TransactionIdIsValid(xmin));
+ }
+ else if (!snapshot->takenDuringRecovery)
{
int *pgprocnos = arrayP->pgprocnos;
int numProcs;
@@ -1437,14 +1464,11 @@ GetSnapshotData(Snapshot snapshot)
TransactionId xid;
/*
- * Backend is doing logical decoding which manages xmin
- * separately, check below.
+ * Ignore procs running LAZY VACUUM (which don't need to retain
+ * rows) and backends doing logical decoding (which manages xmin
+ * separately, check below).
*/
- if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING)
- continue;
-
- /* Ignore procs running LAZY VACUUM */
- if (pgxact->vacuumFlags & PROC_IN_VACUUM)
+ if (pgxact->vacuumFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
continue;
/* Update globalxmin to be the smallest valid xmin */
@@ -1513,6 +1537,31 @@ GetSnapshotData(Snapshot snapshot)
}
}
}
+
+ /* upate cache */
+ {
+ Snapshot csnap = &ProcGlobal->cached_snapshot;
+ TransactionId *saved_xip;
+ TransactionId *saved_subxip;
+
+ ProcGlobal->cached_snapshot_globalxmin = globalxmin;
+
+ saved_xip = csnap->xip;
+ saved_subxip = csnap->subxip;
+ memcpy(csnap, snapshot, sizeof(SnapshotData));
+ csnap->xip = saved_xip;
+ csnap->subxip = saved_subxip;
+
+ /* not yet stored. Yuck */
+ csnap->xmin = xmin;
+
+ memcpy(csnap->xip, snapshot->xip,
+ sizeof(TransactionId) * csnap->xcnt);
+ memcpy(csnap->subxip, snapshot->subxip,
+ sizeof(TransactionId) * csnap->subxcnt);
+ ProcGlobal->cached_snapshot_valid = true;
+ }
+
}
else
{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 65e8afe..a6ef687 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -112,6 +112,13 @@ ProcGlobalShmemSize(void)
size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGXACT)));
size = add_size(size, mul_size(max_prepared_xacts, sizeof(PGXACT)));
+ /* for the cached snapshot */
+#define PROCARRAY_MAXPROCS (MaxBackends + max_prepared_xacts)
+ size = add_size(size, mul_size(sizeof(TransactionId), PROCARRAY_MAXPROCS));
+#define TOTAL_MAX_CACHED_SUBXIDS \
+ ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
+ size = add_size(size, mul_size(sizeof(TransactionId), TOTAL_MAX_CACHED_SUBXIDS));
+
return size;
}
@@ -269,6 +276,12 @@ InitProcGlobal(void)
/* Create ProcStructLock spinlock, too */
ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
SpinLockInit(ProcStructLock);
+
+ /* cached snapshot */
+ ProcGlobal->cached_snapshot_valid = false;
+ ProcGlobal->cached_snapshot.xip = ShmemAlloc(PROCARRAY_MAXPROCS * sizeof(TransactionId));
+ ProcGlobal->cached_snapshot.subxip = ShmemAlloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));
+ ProcGlobal->cached_snapshot_globalxmin = InvalidTransactionId;
}
/*
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d194f38..f483d3b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -16,6 +16,7 @@
#include "access/xlogdefs.h"
#include "lib/ilist.h"
+#include "utils/snapshot.h"
#include "storage/latch.h"
#include "storage/lock.h"
#include "storage/pg_sema.h"
@@ -206,6 +207,11 @@ typedef struct PROC_HDR
int startupProcPid;
/* Buffer id of the buffer that Startup process waits for pin on, or -1 */
int startupBufferPinWaitBufId;
+
+ /* Cached snapshot */
+ bool cached_snapshot_valid;
+ SnapshotData cached_snapshot;
+ TransactionId cached_snapshot_globalxmin;
} PROC_HDR;
extern PROC_HDR *ProcGlobal;
--
2.2.1.212.gc5b9256
On Mon, Feb 2, 2015 at 8:57 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
Hi,
I've, for a while, pondered whether we couldn't find a easier way than
CSN to make snapshots cheaper as GetSnapshotData() very frequently is
one of the top profile entries. Especially on bigger servers, where the
pretty much guaranteed cachemisses are quite visibile.My idea is based on the observation that even in very write heavy
environments the frequency of relevant PGXACT changes is noticeably
lower than GetSnapshotData() calls.Comments about the idea?
I have done some tests with this patch to see the benefit with
and it seems to me this patch helps in reducing the contention
around ProcArrayLock, though the increase in TPS (in tpc-b tests
is around 2~4%) is not very high.
LWLock_Stats data
-----------------------------
Non-Default postgresql.conf settings
------------------------------------------------------
scale_factor = 3000
shared_buffers=8GB
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
autovacuum=off
synchronous_commit=off
Tests are done on Power-8 m/c.
pgbench (TPC-B test)
./pgbench -c 64 -j 64 -T 1200 -M prepared postgres
Without Patch (HEAD - e5f455f5) - Commit used is slightly old, but
I don't think that matters for this test.
ProcArrayLock
--------------
PID 68803 lwlock main 4: shacq 1278232 exacq 124646 blk 231405 spindelay
2904 dequeue self 63701
PID 68888 lwlock main 4: shacq 1325048 exacq 129176 blk 241605 spindelay
3457 dequeue self 65203
PID 68798 lwlock main 4: shacq 1308114 exacq 127462 blk 235331 spindelay
2829 dequeue self 64893
PID 68880 lwlock main 4: shacq 1306959 exacq 127348 blk 235041 spindelay
3007 dequeue self 64662
PID 68894 lwlock main 4: shacq 1307710 exacq 127375 blk 234356 spindelay
3474 dequeue self 64417
PID 68858 lwlock main 4: shacq 1331912 exacq 129671 blk 238083 spindelay
3043 dequeue self 65257
CLogControlLock
----------------
PID 68895 lwlock main 11: shacq 483080 exacq 226903 blk 38253 spindelay 12
dequeue self 37128
PID 68812 lwlock main 11: shacq 471646 exacq 223555 blk 37703 spindelay 15
dequeue self 36616
PID 68888 lwlock main 11: shacq 475769 exacq 226359 blk 38570 spindelay 6
dequeue self 35804
PID 68798 lwlock main 11: shacq 473370 exacq 222993 blk 36806 spindelay 7
dequeue self 37163
PID 68880 lwlock main 11: shacq 472101 exacq 223031 blk 36577 spindelay 5
dequeue self 37544
With Patch -
ProcArrayLock
--------------
PID 159124 lwlock main 4: shacq 1196432 exacq 118140 blk 128880 spindelay
4601 dequeue self 91197
PID 159171 lwlock main 4: shacq 1322517 exacq 130560 blk 141830 spindelay
5180 dequeue self 101283
PID 159139 lwlock main 4: shacq 1294249 exacq 127877 blk 139318 spindelay
5735 dequeue self 100740
PID 159199 lwlock main 4: shacq 1077223 exacq 106398 blk 115625 spindelay
3627 dequeue self 81980
PID 159193 lwlock main 4: shacq 1364230 exacq 134757 blk 146335 spindelay
5390 dequeue self 103907
CLogControlLock
----------------
PID 159124 lwlock main 11: shacq 443221 exacq 202970 blk 88076 spindelay
533 dequeue self 70673
PID 159171 lwlock main 11: shacq 488979 exacq 227730 blk 103233 spindelay
597 dequeue self 76776
PID 159139 lwlock main 11: shacq 469582 exacq 218877 blk 94736 spindelay
493 dequeue self 74813
PID 159199 lwlock main 11: shacq 391470 exacq 181381 blk 74061 spindelay
309 dequeue self 64393
PID 159193 lwlock main 11: shacq 499489 exacq 235390 blk 106459 spindelay
578 dequeue self 76922
We can clearly see that *blk* count with Patch for ProcArrayLock
has decreased significantly, though it results in increase of blk
count in CLogControlLock, but that is the effect of shift in contention.
+1 to proceed with this patch for 9.6, as I think this patch improves the
situation with compare to current.
Also I have seen crash once in below test scenario:
Crashed in test with scale-factor - 300, other settings same as above:
./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On 2015-05-20 19:56:39 +0530, Amit Kapila wrote:
I have done some tests with this patch to see the benefit with
and it seems to me this patch helps in reducing the contention
around ProcArrayLock, though the increase in TPS (in tpc-b tests
is around 2~4%) is not very high.pgbench (TPC-B test)
./pgbench -c 64 -j 64 -T 1200 -M prepared postgres
Hm, so it's a read mostly test. I probably not that badly contended on
the snapshot acquisition itself. I'd guess a 80/20 read/write mix or so
would be more interesting for the cases where we hit this really bad.
Without Patch (HEAD - e5f455f5) - Commit used is slightly old, but I
don't think that matters for this test.
Agreed, shouldn't make much of a difference.
+1 to proceed with this patch for 9.6, as I think this patch improves the
situation with compare to current.
Yea, I think so too.
Also I have seen crash once in below test scenario:
Crashed in test with scale-factor - 300, other settings same as above:
./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
The patch as is really is just a proof of concept. I wrote it during a
talk the flight back from fosdem...
Thanks for the look.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, May 26, 2015 at 12:10 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-05-20 19:56:39 +0530, Amit Kapila wrote:
I have done some tests with this patch to see the benefit with
and it seems to me this patch helps in reducing the contention
around ProcArrayLock, though the increase in TPS (in tpc-b tests
is around 2~4%) is not very high.pgbench (TPC-B test)
./pgbench -c 64 -j 64 -T 1200 -M prepared postgresHm, so it's a read mostly test.
Write not *Read* mostly.
I probably not that badly contended on
the snapshot acquisition itself. I'd guess a 80/20 read/write mix or so
would be more interesting for the cases where we hit this really bad.
Yes 80/20 read/write mix will be good test to test this patch and I think
such a load is used by many applications (Such a load is quite common
in telecom especially their billing related applications) and currently we
don't
have such a test handy to measure performance.
On a side note, I think it would be good if we can add such a test to
pgbench or may be use some test which adheres to TPC-C specification.
Infact, I remember [1]/messages/by-id/E8870A2F6A4B1045B1C292B77EAB207C77069A80@SZXEMA501-MBX.china.huawei.com people posting test results with such a workload
showing ProcArrayLock as contention.
[1]: /messages/by-id/E8870A2F6A4B1045B1C292B77EAB207C77069A80@SZXEMA501-MBX.china.huawei.com
/messages/by-id/E8870A2F6A4B1045B1C292B77EAB207C77069A80@SZXEMA501-MBX.china.huawei.com
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 2, 2015 at 8:57 PM, Andres Freund <andres@2ndquadrant.com>
wrote:
Hi,
I've, for a while, pondered whether we couldn't find a easier way than
CSN to make snapshots cheaper as GetSnapshotData() very frequently is
one of the top profile entries. Especially on bigger servers, where the
pretty much guaranteed cachemisses are quite visibile.My idea is based on the observation that even in very write heavy
environments the frequency of relevant PGXACT changes is noticeably
lower than GetSnapshotData() calls.My idea is to simply cache the results of a GetSnapshotData() result in
shared memory and invalidate it everytime something happens that affects
the results. Then GetSnapshotData() can do a couple of memcpy() calls to
get the snapshot - which will be significantly faster in a large number
of cases. For one often enough there's many transactions without an xid
assigned (and thus xip/subxip are small), for another, even if that's
not the case it's linear copies instead of unpredicable random accesses
through PGXACT/PGPROC.Now, that idea is pretty handwavy. After talking about it with a couple
of people I've decided to write a quick POC to check whether it's
actually beneficial. That POC isn't anything close to being ready or
complete. I just wanted to evaluate whether the idea has some merit or
not. That said, it survives make installcheck-parallel.Some very preliminary performance results indicate a growth of between
25% (pgbench -cj 796 -m prepared -f 'SELECT 1'), 15% (pgbench -s 300 -S
-cj 796), 2% (pgbench -cj 96 -s 300) on a 4 x E5-4620 system. Even on my
laptop I can measure benefits in a readonly, highly concurrent,
workload; although unsurprisingly much smaller.Now, these are all somewhat extreme workloads, but still. It's a nice
improvement for a quick POC.So far the implemented idea is to just completely wipe the cached
snapshot everytime somebody commits. I've afterwards not been able to
see GetSnapshotData() in the profile at all - so that possibly is
actually sufficient?This implementation probably has major holes. Like it probably ends up
not really increasing the xmin horizon when a longrunning readonly
transaction without an xid commits...Comments about the idea?
FWIW I'd presented somewhat similar idea and also a patch a few years back
and from what I remember, I'd seen good results with the patch. So +1 for
the idea.
/messages/by-id/CABOikdMsJ4OsxtA7XBV2quhKYUo_4105fJF4N+uyRoyBAzSuuQ@mail.gmail.com
Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
On 5/25/15 10:04 PM, Amit Kapila wrote:
On Tue, May 26, 2015 at 12:10 AM, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:On 2015-05-20 19:56:39 +0530, Amit Kapila wrote:
I have done some tests with this patch to see the benefit with
and it seems to me this patch helps in reducing the contention
around ProcArrayLock, though the increase in TPS (in tpc-b tests
is around 2~4%) is not very high.pgbench (TPC-B test)
./pgbench -c 64 -j 64 -T 1200 -M prepared postgresHm, so it's a read mostly test.
Write not *Read* mostly.
I probably not that badly contended on
the snapshot acquisition itself. I'd guess a 80/20 read/write mix or so
would be more interesting for the cases where we hit this really bad.Yes 80/20 read/write mix will be good test to test this patch and I think
such a load is used by many applications (Such a load is quite common
in telecom especially their billing related applications) and currently
we don't
have such a test handy to measure performance.On a side note, I think it would be good if we can add such a test to
pgbench or may be use some test which adheres to TPC-C specification.
Infact, I remember [1] people posting test results with such a workload
showing ProcArrayLock as contention.[1] -
/messages/by-id/E8870A2F6A4B1045B1C292B77EAB207C77069A80@SZXEMA501-MBX.china.huawei.com
Anything happen with this?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 1, 2015 at 8:46 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 5/25/15 10:04 PM, Amit Kapila wrote:
On Tue, May 26, 2015 at 12:10 AM, Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:On 2015-05-20 19:56:39 +0530, Amit Kapila wrote:
I have done some tests with this patch to see the benefit with
and it seems to me this patch helps in reducing the contention
around ProcArrayLock, though the increase in TPS (in tpc-b tests
is around 2~4%) is not very high.pgbench (TPC-B test)
./pgbench -c 64 -j 64 -T 1200 -M prepared postgresHm, so it's a read mostly test.
Write not *Read* mostly.
I probably not that badly contended on
the snapshot acquisition itself. I'd guess a 80/20 read/write mix or so
would be more interesting for the cases where we hit this really bad.Yes 80/20 read/write mix will be good test to test this patch and I think
such a load is used by many applications (Such a load is quite common
in telecom especially their billing related applications) and currently
we don't
have such a test handy to measure performance.On a side note, I think it would be good if we can add such a test to
pgbench or may be use some test which adheres to TPC-C specification.
Infact, I remember [1] people posting test results with such a workload
showing ProcArrayLock as contention.[1] -
/messages/by-id/E8870A2F6A4B1045B1C292B77EAB207C77069A80@SZXEMA501-MBX.china.huawei.com
Anything happen with this?
No. I think one has to study the impact of this patch on latest code
especially after commit-0e141c0f.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com