Improving connection scalability: GetSnapshotData()
Hi,
I think postgres' issues with scaling to larger numbers of connections
is a serious problem in the field. While poolers can address some of
that, given the issues around prepared statements, transaction state,
etc, I don't think that's sufficient in many cases. It also adds
latency.
Nor do I think the argument that one shouldn't have more than a few
dozen connection holds particularly much water. As clients have think
time, and database results have to be sent/received (most clients don't
use pipelining), and as many applications have many application servers
with individual connection pools, it's very common to need more
connections than postgres can easily deal with.
The largest reason for that is GetSnapshotData(). It scales poorly to
larger connection counts. Part of that is obviously it's O(connections)
nature, but I always thought it had to be more. I've seen production
workloads spending > 98% of the cpu time n GetSnapshotData().
After a lot of analysis and experimentation I figured out that the
primary reason for this is PGXACT->xmin. Even the simplest transaction
modifies MyPgXact->xmin several times during its lifetime (IIRC twice
(snapshot & release) for exec_bind_message(), same for
exec_exec_message(), then again as part of EOXact processing). Which
means that a backend doing GetSnapshotData() on a system with a number
of other connections active, is very likely to hit PGXACT cachelines
that are owned by another cpu / set of cpus / socket. The larger the
system is, the worse the consequences of this are.
This problem is most prominent (and harder to fix) for xmin, but also
exists for the other fields in PGXACT. We rarely have xid, nxids,
overflow, or vacuumFlags set, yet constantly set them, leading to
cross-node traffic.
The second biggest problem is that the indirection through pgprocnos
that GetSnapshotData() has to do to go through to get each backend's
xmin is very unfriendly for a pipelined CPU (i.e. all that postgres runs
on). There's basically a stall at the end of every loop iteration -
which is exascerbated by there being so many cache misses.
It's fairly easy to avoid unnecessarily dirtying cachelines for all the
PGXACT fields except xmin. Because that actually needs to be visible to
other backends.
While it sounds almost trivial in hindsight, it took me a long while to
grasp a solution to a big part of this problem: We don't actually need
to look at PGXACT->xmin to compute a snapshot. The only reason that
GetSnapshotData() does so, is because it also computes
RecentGlobal[Data]Xmin.
But we don't actually need them all that frequently. They're primarily
used as a horizons for heap_page_prune_opt() etc. But for one, while
pruning is really important, it doesn't happen *all* the time. But more
importantly a RecentGlobalXmin from an earlier transaction is actually
sufficient for most pruning requests, especially when there is a larger
percentage of reading than updating transaction (very common).
By having GetSnapshotData() compute an accurate upper bound after which
we are certain not to be able to prune (basically the transaction's
xmin, slots horizons, etc), and a conservative lower bound below which
we are definitely able to prune, we can allow some pruning actions to
happen. If a pruning request (or something similar) encounters an xid
between those, an accurate lower bound can be computed.
That allows to avoid looking at PGXACT->xmin.
To address the second big problem (the indirection), we can instead pack
the contents of PGXACT tightly, just like we do for pgprocnos. In the
attached series, I introduced separate arrays for xids, vacuumFlags,
nsubxids.
The reason for splitting them is that they change at different rates,
and different sizes. In a read-mostly workload, most backends are not
going to have an xid, therefore making the xids array almost
constant. As long as all xids are unassigned, GetSnapshotData() doesn't
need to look at anything else, therefore making it sensible to check the
xid first.
Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.
The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.
conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371
I think these are pretty nice results.
Note that the patchset currently does not implement snapshot_too_old,
the rest of the regression tests do pass.
One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.
So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.
I'm still working on cleaning that part of the patch up, I'll post it in
a bit.
The series currently consists out of:
0001-0005: Fixes and assert improvements that are independent of the patch, but
are hit by the new code (see also separate thread).
0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified
0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).
Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.
There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?
To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?
This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.
0008: Move PGXACT->xmin back to PGPROC.
Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.
0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids access
These successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..
I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?
For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.
To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?
0012: Remove now unused PGXACT.
There's no reason to have it anymore.
The patchseries is also available at
https://github.com/anarazel/postgres/tree/pgxact-split
Greetings,
Andres Freund
Attachments:
Hi,
On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
Attached as a graph as well.
Attachments:
On Sun, Mar 1, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
Nice improvement. +1 for improving the scalability for higher connection count.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
Here are some numbers for the submitted patch series. I'd to cull some
further improvements to make it more manageable, but I think the numbers
still are quite convincing.The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.I'm still working on cleaning that part of the patch up, I'll post it in
a bit.
I triggered a longer run on the same hardware, that also includes
numbers for the caching patch.
nclients master pgxact-split pgxact-split-cache
1 29742.805074 29086.874404 28120.709885
2 58653.005921 56610.432919 57343.937924
3 116580.383993 115102.94057 117512.656103
4 150821.023662 154130.354635 152053.714824
5 186679.754357 189585.156519 191095.841847
6 219013.756252 223053.409306 224480.026711
7 256861.673892 256709.57311 262427.179555
8 291495.547691 294311.524297 296245.219028
9 332835.641015 333223.666809 335460.280487
10 367883.74842 373562.206447 375682.894433
15 561008.204553 578601.577916 587542.061911
20 748000.911053 794048.140682 810964.700467
25 904581.660543 1037279.089703 1043615.577083
30 999231.007768 1251113.123461 1288276.726489
35 1001274.289847 1438640.653822 1438508.432425
40 991672.445199 1518100.079695 1573310.171868
45 994427.395069 1575758.31948 1649264.339117
50 1017561.371878 1654776.716703 1715762.303282
60 993943.210188 1720318.989894 1789698.632656
70 971379.995255 1729836.303817 1819477.25356
80 966276.137538 1744019.347399 1842248.57152
90 901175.211649 1768907.069263 1847823.970726
100 803175.74326 1784636.397822 1865795.782943
125 664438.039582 1806275.514545 1870983.64688
150 623562.201749 1796229.009658 1876529.428419
175 680683.150597 1809321.487338 1910694.40987
200 668413.988251 1833457.942035 1878391.674828
225 682786.299485 1816577.462613 1884587.77743
250 727308.562076 1825796.324814 1864692.025853
275 676295.999761 1843098.107926 1908698.584573
300 698831.398432 1832068.168744 1892735.290045
400 661534.639489 1859641.983234 1898606.247281
500 645149.788352 1851124.475202 1888589.134422
600 740636.323211 1875152.669115 1880653.747185
700 858645.363292 1833527.505826 1874627.969414
800 858287.957814 1841914.668668 1892106.319085
900 882204.933544 1850998.221969 1868260.041595
1000 910988.551206 1836336.091652 1862945.18557
1500 917727.92827 1808822.338465 1864150.00307
2000 982137.053108 1813070.209217 1877104.342864
3000 1013514.639108 1753026.733843 1870416.924248
4000 1025476.80688 1600598.543635 1859908.314496
5000 1019889.160511 1534501.389169 1870132.571895
7500 968558.864242 1352137.828569 1853825.376742
10000 887558.112017 1198321.352461 1867384.381886
15000 687766.593628 950788.434914 1710509.977169
The odd dip for master between 90 and 700 connections looks like it's
not directly related to GetSnapshotData(). It looks like it's related to
the linux scheduler and virtiualization. When a pgbench thread and
postgres backend need to swap who gets executed, and both are on
different CPUs, the wakeup is more expensive when the target CPU is idle
or isn't going to reschedule soon. In the expensive path a
inter-process-interrupt (IPI) gets triggered, which requires to exit out
of the VM (which is really expensive on azure, apparently). I can
trigger similar behaviour for the other runs by renicing, albeit on a
slightly smaller scale.
I'll try to find a larger system that's not virtualized :/.
Greetings,
Andres Freund
Attachments:
csn.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/include/access/transam.h w/src/include/access/transam.h
index e37ecb52fa1..2b46d2b76a0 100644
--- i/src/include/access/transam.h
+++ w/src/include/access/transam.h
@@ -183,6 +183,8 @@ typedef struct VariableCacheData
TransactionId latestCompletedXid; /* newest XID that has committed or
* aborted */
+ uint64 csn;
+
/*
* These fields are protected by CLogTruncationLock
*/
diff --git i/src/include/utils/snapshot.h w/src/include/utils/snapshot.h
index 2bc415376ac..389f18cf8a5 100644
--- i/src/include/utils/snapshot.h
+++ w/src/include/utils/snapshot.h
@@ -207,6 +207,8 @@ typedef struct SnapshotData
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
+
+ uint64 csn;
} SnapshotData;
#endif /* SNAPSHOT_H */
diff --git i/src/backend/storage/ipc/procarray.c w/src/backend/storage/ipc/procarray.c
index 617853f56fc..520a79c9f73 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -65,6 +65,7 @@
#include "utils/snapmgr.h"
#define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var))))
+#define UINT64_ACCESS_ONCE(var) ((uint64)(*((volatile uint64 *)&(var))))
/* Our shared memory area */
typedef struct ProcArrayStruct
@@ -266,6 +267,7 @@ CreateSharedProcArray(void)
procArray->lastOverflowedXid = InvalidTransactionId;
procArray->replication_slot_xmin = InvalidTransactionId;
procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+ ShmemVariableCache->csn = 1;
}
allProcs = ProcGlobal->allProcs;
@@ -404,6 +406,8 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+ /* Same with CSN */
+ ShmemVariableCache->csn++;
ProcGlobal->xids[proc->pgxactoff] = 0;
ProcGlobal->nsubxids[proc->pgxactoff] = 0;
@@ -531,6 +535,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
{
size_t pgxactoff = proc->pgxactoff;
+ Assert(LWLockHeldByMe(ProcArrayLock));
Assert(TransactionIdIsValid(ProcGlobal->xids[pgxactoff]));
Assert(ProcGlobal->xids[pgxactoff] == proc->xidCopy);
@@ -561,6 +566,9 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
latestXid))
ShmemVariableCache->latestCompletedXid = latestXid;
+
+ /* Same with CSN */
+ ShmemVariableCache->csn++;
}
/*
@@ -1637,7 +1645,7 @@ GetSnapshotData(Snapshot snapshot)
TransactionId oldestxid;
int mypgxactoff;
TransactionId myxid;
-
+ uint64 csn;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
@@ -1675,6 +1683,26 @@ GetSnapshotData(Snapshot snapshot)
errmsg("out of memory")));
}
+#if 1
+ if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+ UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
+ {
+ if (!TransactionIdIsValid(MyProc->xmin))
+ MyProc->xmin = TransactionXmin = snapshot->xmin;
+ RecentXmin = snapshot->xmin;
+ Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
+
+ snapshot->curcid = GetCurrentCommandId(false);
+ snapshot->active_count = 0;
+ snapshot->regd_count = 0;
+ snapshot->copied = false;
+ snapshot->lsn = InvalidXLogRecPtr;
+ snapshot->whenTaken = 0;
+
+ return snapshot;
+ }
+#endif
+
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyProc->xmin.
@@ -1687,6 +1715,7 @@ GetSnapshotData(Snapshot snapshot)
nextfxid = ShmemVariableCache->nextFullXid;
oldestxid = ShmemVariableCache->oldestXid;
+ csn = ShmemVariableCache->csn;
/* xmax is always latestCompletedXid + 1 */
xmax = ShmemVariableCache->latestCompletedXid;
@@ -1941,6 +1970,8 @@ GetSnapshotData(Snapshot snapshot)
snapshot->regd_count = 0;
snapshot->copied = false;
+ snapshot->csn = csn;
+
if (old_snapshot_threshold < 0)
{
/*
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 41914f1a6c7..9d8ebe51307 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -604,6 +604,8 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
CurrentSnapshot->takenDuringRecovery = sourcesnap->takenDuringRecovery;
/* NB: curcid should NOT be copied, it's a local matter */
+ CurrentSnapshot->csn = 0;
+
/*
* Now we have to fix what GetSnapshotData did with MyPgXact->xmin and
* TransactionXmin. There is a race condition: to make sure we are not
@@ -679,6 +681,7 @@ CopySnapshot(Snapshot snapshot)
newsnap->regd_count = 0;
newsnap->active_count = 0;
newsnap->copied = true;
+ newsnap->csn = 0;
/* setup XID array */
if (snapshot->xcnt > 0)
@@ -2180,6 +2183,7 @@ RestoreSnapshot(char *start_address)
snapshot->curcid = serialized_snapshot.curcid;
snapshot->whenTaken = serialized_snapshot.whenTaken;
snapshot->lsn = serialized_snapshot.lsn;
+ snapshot->csn = 0;
/* Copy XIDs, if present. */
if (serialized_snapshot.xcnt > 0)
On 3/1/20 3:36 AM, Andres Freund wrote:
I think these are pretty nice results.
Indeed they are.
Is the target version PG13 or PG14? It seems like a pretty big patch to
go in the last commitfest for PG13.
Regards,
--
-David
david@pgmasters.net
Hi,
Nice performance gains.
On Sun, 1 Mar 2020 at 21:36, Andres Freund <andres@anarazel.de> wrote:
The series currently consists out of:
0001-0005: Fixes and assert improvements that are independent of the patch, but
are hit by the new code (see also separate thread).0006: Move delayChkpt from PGXACT to PGPROC it's rarely checked & frequently modified
0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.0008: Move PGXACT->xmin back to PGPROC.
Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
I've only looked at 0001-0009 so far. I'm not quite the expert in this
area, so the review feels a bit superficial. Here's what I noted down
during my pass.
0001
1. cant't -> can't
* snapshot cant't change in the midst of a relcache build, so there's no
0002
2. I don't quite understand your change in
UpdateSubscriptionRelState(). snap seems unused. Drilling down into
SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
passes a NULL snapshot.
the whole patch does this. I guess I don't understand why 0002 does this.
0004
3. This comment seems to have the line order swapped in bt_check_every_level
/*
* RecentGlobalXmin/B-Tree page deletion.
* This assertion matches the one in index_getnext_tid(). See note on
*/
Assert(SnapshotSet());
0006
4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
you slot it in somewhere it would fit in existing padding?
0007
5. GinPageIsRecyclable() has no comments at all. I know that
ginvacuum.c is not exactly the modal citizen for function header
comments, but likely this patch is no good reason to continue the
trend.
6. The comment rearrangement in bt_check_every_level should be in the
0004 patch.
7. struct InvisibleToEveryoneState could do with some comments
explaining the fields.
8. The header comment in GetOldestXminInt needs to be updated. It
talks about "if rel = NULL and there are no transactions", but there's
no parameter by that name now. Maybe the whole comment should be moved
down to the external implementation of the function
9. I get the idea you don't intend to keep the debug message in
InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
using UINT64_FORMAT?
10. teh -> the
* which is based on teh value computed when getting the current snapshot.
11. InvisibleToEveryoneCheckXid and InvisibleToEveryoneCheckFullXid
seem to have their extern modifiers in the .c file.
0009
12. iare -> are
* These iare separate from the main PGPROC array so that the most heavily
13. is -> are
* accessed data is stored contiguously in memory in as few cache lines as
14. It doesn't seem to quite make sense to talk about "this proc" in:
/*
* TransactionId of top-level transaction currently being executed by this
* proc, if running and XID is assigned; else InvalidTransactionId.
*
* Each PGPROC has a copy of its value in PGPROC.xidCopy.
*/
TransactionId *xids;
maybe "this" can be replaced with "each"
I will try to continue with the remaining patches soon. However, it
would be good to get a more complete patchset. I feel there are quite
a few XXX comments remaining for things you need to think about later,
and ... it's getting late.
On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:
conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371
This will clearly be really big news for lots of PostgreSQL users.
One further cool recognition of the fact that GetSnapshotData()'s
results can be made to only depend on the set of xids in progress, is
that caching the results of GetSnapshotData() is almost trivial at that
point: We only need to recompute snapshots when a toplevel transaction
commits/aborts.So we can avoid rebuilding snapshots when no commt has happened since it
was last built. Which amounts to assigning a current 'commit sequence
number' to the snapshot, and checking that against the current number
at the time of the next GetSnapshotData() call. Well, turns out there's
this "LSN" thing we assign to commits (there are some small issues with
that though). I've experimented with that, and it considerably further
improves the numbers above. Both with a higher peak throughput, but more
importantly it almost entirely removes the throughput regression from
2000 connections onwards.I'm still working on cleaning that part of the patch up, I'll post it in
a bit.
I looked at that part on your public pgxact-split branch. In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that. We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c. This also calls to mind the
2013-2016 work by Ants Aasma and others[1]/messages/by-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com on CSN-based snapshots,
which is obviously a much more radical change, but really means what
it says (commits). The CSN in your patch set is used purely as a
level-change for snapshot cache invalidation IIUC, and it advances
also for aborts -- so maybe it should be called something like
completed_xact_count, using existing terminology from procarray.c.
+ if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId &&
+ UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)
Why is it OK to read ShmemVariableCache->csn without at least a read
barrier? I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly. Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?
Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.
0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?
+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!
+/*
+ * Be very careful about when to use this function. It can only safely be used
+ * when there is a guarantee that, at the time of the call, xid is within 2
+ * billion xids of rel. That e.g. can be guaranteed if the the caller assures
+ * a snapshot is held by the backend, and xid is from a table (where
+ * vacuum/freezing ensures the xid has to be within that range).
+ */
+static inline FullTransactionId
+FullXidViaRelative(FullTransactionId rel, TransactionId xid)
+{
+ uint32 rel_epoch = EpochFromFullTransactionId(rel);
+ TransactionId rel_xid = XidFromFullTransactionId(rel);
+ uint32 epoch;
+
+ /*
+ * TODO: A function to easily write an assertion ensuring that xid is
+ * between [oldestXid, nextFullXid) woudl be useful here, and in plenty
+ * other places.
+ */
+
+ if (xid > rel_xid)
+ epoch = rel_epoch - 1;
+ else
+ epoch = rel_epoch;
+
+ return FullTransactionIdFromEpochAndXid(epoch, xid);
+}
I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion. An upper bound would
also seem better, though requires much trickier interlocking. What
you have means "it's near here!"... isn't that too prone to bugs that
are hidden because of the ambient fuzziness? A lower bound seems like
it could move extremely infrequently and therefore it'd be OK for it
to be protected by both proc array and xid gen locks (ie it'd be
recomputed when nextFxid needs to move too far ahead of it, so every
~2 billion xacts). I haven't looked at this long enough to have a
strong opinion, though.
On a more constructive note:
GetOldestXminInt() does:
LWLockAcquire(ProcArrayLock, LW_SHARED);
+ nextfxid = ShmemVariableCache->nextFullXid;
+
...
LWLockRelease(ProcArrayLock);
...
+ return FullXidViaRelative(nextfxid, result);
But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?
This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.
CCing Kevin as an FYI.
0008: Move PGXACT->xmin back to PGPROC.
Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids accessThese successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?
In the places where you now acquire both, I guess you also need to
release both in the error path?
[1]: /messages/by-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com
Hi,
Thanks for looking!
On 2020-03-20 18:23:03 +1300, Thomas Munro wrote:
On Sun, Mar 1, 2020 at 9:36 PM Andres Freund <andres@anarazel.de> wrote:
I'm still working on cleaning that part of the patch up, I'll post it in
a bit.I looked at that part on your public pgxact-split branch. In that
version you used "CSN" rather than something based on LSNs, which I
assume avoids complications relating to WAL locking or something like
that.
Right, I first tried to use LSNs, but after further tinkering found that
it's too hard to address the difference between visiblity order and LSN
order. I don't think there's an easy way to address the difference.
We should probably be careful to avoid confusion with the
pre-existing use of the term "commit sequence number" (CommitSeqNo,
CSN) that appears in predicate.c.
I looked at that after you mentioned it on IM. But I find it hard to
grok what it's precisely defined at. There's basically no comments
explaining what it's really supposed to do, and I find the relevant code
far from easy to grok :(.
This also calls to mind the 2013-2016 work by Ants Aasma and others[1]
on CSN-based snapshots, which is obviously a much more radical change,
but really means what it says (commits).
Well, I think you could actually build some form of more dense snapshots
ontop of "my" CSN, with a bit of effort (and lot of handwaving). I don't
think they're that different concepts.
The CSN in your patch set is used purely as a level-change for
snapshot cache invalidation IIUC, and it advances also for aborts --
so maybe it should be called something like completed_xact_count,
using existing terminology from procarray.c.
I expect it to be used outside of snapshots too, in the future, FWIW.
completed_xact_count sounds good to me.
+ if (snapshot->csn != 0 && MyProc->xidCopy == InvalidTransactionId && + UINT64_ACCESS_ONCE(ShmemVariableCache->csn) == snapshot->csn)Why is it OK to read ShmemVariableCache->csn without at least a read
barrier? I suppose this allows a cached snapshot to be used very soon
after a transaction commits and should be visible to you, but ...
hmmmrkwjherkjhg... I guess it must be really hard to observe any
anomaly. Let's see... maybe it's possible on a relaxed memory system
like POWER or ARM, if you use a shm flag to say "hey I just committed
a transaction", and the other guy sees the flag but can't yet see the
new CSN, so an SPI query can't see the transaction?
Yea, it does need more thought / comments. I can't really see an actual
correctness violation though. As far as I can tell you'd never be able
to get an "older" ShmemVariableCache->csn than one since *after* the
last lock acquired/released by the current backend - which then also
means a different "ordering" would have been possible allowing the
current backend to take the snapshot earlier.
Another theoretical problem is the non-atomic read of a uint64 on some
32 bit platforms.
Yea, it probably should be a pg_atomic_uint64 to address that. I don't
think it really would cause problems, because I think it'd always end up
causing an unnecessary snapshot build. But there's no need to go there.
0007: WIP: Introduce abstraction layer for "is tuple invisible" tests.
This is the most crucial piece. Instead of code directly using
RecentOldestXmin, there's a new set of functions for testing
whether an xid is visible (InvisibleToEveryoneTestXid() et al).Those function use new horizon boundaries computed as part of
GetSnapshotData(), and recompute an accurate boundary when the
tested xid falls inbetween.There's a bit more infrastructure needed - we need to limit how
often an accurate snapshot is computed. Probably to once per
snapshot? Or once per transaction?To avoid issues with the lower boundary getting too old and
presenting a wraparound danger, I made all the xids be
FullTransactionIds. That imo is a good thing?+1, as long as we don't just move the wraparound danger to the places
where we convert xids to fxids!+/* + * Be very careful about when to use this function. It can only safely be used + * when there is a guarantee that, at the time of the call, xid is within 2 + * billion xids of rel. That e.g. can be guaranteed if the the caller assures + * a snapshot is held by the backend, and xid is from a table (where + * vacuum/freezing ensures the xid has to be within that range). + */ +static inline FullTransactionId +FullXidViaRelative(FullTransactionId rel, TransactionId xid) +{ + uint32 rel_epoch = EpochFromFullTransactionId(rel); + TransactionId rel_xid = XidFromFullTransactionId(rel); + uint32 epoch; + + /* + * TODO: A function to easily write an assertion ensuring that xid is + * between [oldestXid, nextFullXid) woudl be useful here, and in plenty + * other places. + */ + + if (xid > rel_xid) + epoch = rel_epoch - 1; + else + epoch = rel_epoch; + + return FullTransactionIdFromEpochAndXid(epoch, xid); +}I hate it, but I don't have a better concrete suggestion right now.
Whatever I come up with amounts to the same thing on some level,
though I feel like it might be better to used an infrequently updated
oldestFxid as the lower bound in a conversion.
I am not sure it's as clearly correct to use oldestFxid in as many
cases. Normally PGPROC->xmin (PGXACT->xmin currently) should prevent the
"system wide" xid horizon to move too far relative to that, but I think
there are more plausible problems with the "oldest" xid horizon to move
concurrently with the a backend inspecting values.
It shouldn't be a problem here since the values are taken under a lock
preventing both from being moved I think, and since we're only comparing
those two values without taking anything else into account, the "global"
horizon changing concurrently wouldn't matter.
But it seems easier to understand the correctness when comparing to
nextXid?
What's the benefit of looking at an "infrequently updated" value
instead? I guess you can argue that it'd be more likely to be in cache,
but since all of this lives in a single cacheline...
An upper bound would also seem better, though requires much trickier
interlocking. What you have means "it's near here!"... isn't that too
prone to bugs that are hidden because of the ambient fuzziness?
I can't follow the last sentence. Could you expand?
On a more constructive note:
GetOldestXminInt() does:
LWLockAcquire(ProcArrayLock, LW_SHARED);
+ nextfxid = ShmemVariableCache->nextFullXid; + ... LWLockRelease(ProcArrayLock); ... + return FullXidViaRelative(nextfxid, result);But nextFullXid is protected by XidGenLock; maybe that's OK from a
data freshness point of view (I'm not sure), but from an atomicity
point of view, you can't do that can you?
Hm. Yea, I think it's not safe against torn 64bit reads, you're right.
This patch currently breaks old_snapshot_threshold, as I've not
yet integrated it with the new functions. I think we can make the
old snapshot stuff a lot more precise with this - instead of
always triggering conflicts when a RecentGlobalXmin is too old, we
can do so only in the cases we actually remove a row. I ran out of
energy threading that through the heap_page_prune and
HeapTupleSatisfiesVacuum.CCing Kevin as an FYI.
If anybody has an opinion on this sketch I'd be interested. I've started
to implement it, so ...
0008: Move PGXACT->xmin back to PGPROC.
Now that GetSnapshotData() doesn't access xmin anymore, we can
make it a normal field in PGPROC again.0009: Improve GetSnapshotData() performance by avoiding indirection for xid access.
0010: Improve GetSnapshotData() performance by avoiding indirection for vacuumFlags
0011: Improve GetSnapshotData() performance by avoiding indirection for nsubxids accessThese successively move the remaining PGXACT fields into separate
arrays in ProcGlobal, and adjust GetSnapshotData() to take
advantage. Those arrays are dense in the sense that they only
contain data for PGPROCs that are in use (i.e. when disconnecting,
the array is moved around)..I think xid, and vacuumFlags are pretty reasonable. But need
cleanup, obviously:
- The biggest cleanup would be to add a few helper functions for
accessing the values, rather than open coding that.
- Perhaps we should call the ProcGlobal ones 'cached', and name
the PGPROC ones as the one true source of truth?For subxid I thought it'd be nice to have nxids and overflow be
only one number. But that probably was the wrong call? Now
TransactionIdInProgress() cannot look at at the subxids that did
fit in PGPROC.subxid. I'm not sure that's important, given the
likelihood of misses? But I'd probably still have the subxid
array be one of {uint8 nsubxids; bool overflowed} instead.To keep the arrays dense they copy the logic for pgprocnos. Which
means that ProcArrayAdd/Remove move things around. Unfortunately
that requires holding both ProcArrayLock and XidGenLock currently
(to avoid GetNewTransactionId() having to hold ProcArrayLock). But
that doesn't seem too bad?In the places where you now acquire both, I guess you also need to
release both in the error path?
Hm. I guess you mean:
if (arrayP->numProcs >= arrayP->maxProcs)
{
/*
* Oops, no room. (This really shouldn't happen, since there is a
* fixed supply of PGPROC structs too, and so we should have failed
* earlier.)
*/
LWLockRelease(ProcArrayLock);
ereport(FATAL,
(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
errmsg("sorry, too many clients already")));
}
I think we should just remove the LWLockRelease? At this point we
already have set up ProcKill(), which would release all lwlocks after
the error was thrown?
Greetings,
Andres Freund
Hi,
On 2020-03-17 23:59:14 +1300, David Rowley wrote:
Nice performance gains.
Thanks.
On Sun, 1 Mar 2020 at 21:36, Andres Freund <andres@anarazel.de> wrote:
2. I don't quite understand your change in
UpdateSubscriptionRelState(). snap seems unused. Drilling down into
SearchSysCacheCopy2, in SearchCatCacheMiss() the systable_beginscan()
passes a NULL snapshot.the whole patch does this. I guess I don't understand why 0002 does this.
See the thread at /messages/by-id/20200229052459.wzhqnbhrriezg4v2@alap3.anarazel.de
Basically, the way catalog snapshots are handled right now, it's not
correct to much without a snapshot held. Any concurrent invalidation can
cause the catalog snapshot to be released, which can reset the backend's
xmin. Which in turn can allow for pruning etc to remove required data.
This is part of this series only because I felt I needed to add stronger
asserts to be confident in what's happening. And they started to trigger
all over :( - and weren't related to the patchset :(.
4. Did you consider the location of 'delayChkpt' in PGPROC. Couldn't
you slot it in somewhere it would fit in existing padding?0007
Hm, maybe. I'm not sure what the best thing to do here is - there's some
arguments to be made that we should keep the fields moved from PGXACT
together on their own cacheline. Compared to some of the other stuff in
PGPROC they're still accessed from other backends fairly frequently.
5. GinPageIsRecyclable() has no comments at all. I know that
ginvacuum.c is not exactly the modal citizen for function header
comments, but likely this patch is no good reason to continue the
trend.
Well, I basically just moved the code from the macro of the same
name... I'll add something.
9. I get the idea you don't intend to keep the debug message in
InvisibleToEveryoneTestFullXid(), but if you do, then shouldn't it be
using UINT64_FORMAT?
Yea, I don't intend to keep them - they're way too verbose, even for
DEBUG*. Note that there's some advantage in the long long cast approach
- it's easier to deal with for translations IIRC.
13. is -> are
* accessed data is stored contiguously in memory in as few cache lines as
Oh? 'data are stored' sounds wrong to me, somehow.
Greetings,
Andres Freund
On Sun, Mar 29, 2020 at 1:49 PM Andres Freund <andres@anarazel.de> wrote:
13. is -> are
* accessed data is stored contiguously in memory in as few cache lines as
Oh? 'data are stored' sounds wrong to me, somehow.
In computer contexts it seems pretty well established that we treat
"data" as an uncountable noun (like "air"), so I think "is" is right
here. In maths or science contexts it's usually treated as a plural
following Latin, which admittedly sounds cleverer, but it also has a
slightly different meaning, not bits and bytes but something more like
samples or (wince) datums.
On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:
The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.
I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)
--
Peter Geoghegan
On Sat, Mar 28, 2020 at 06:39:32PM -0700, Peter Geoghegan wrote:
On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:
The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.
+1
I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)
+1
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Hi,
On 2020-03-28 18:39:32 -0700, Peter Geoghegan wrote:
I have heard quite a few complaints about the scalability of snapshot
acquisition in Postgres. Generally from very large users that are not
well represented on the mailing lists, for a variety of reasons. The
GetSnapshotData() bottleneck is a *huge* problem for us. (As problems
for Postgres users go, I would probably rank it second behind issues
with VACUUM.)
Yea, I see it similarly. For busy databases, my experience is that
vacuum is the big problem for write heavy workloads (or the write
portion), and snapshot scalability the big problem for read heavy oltp
workloads.
This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.
I've done a fair bit of cleanup, but I'm still fighting with how to
implement old_snapshot_threshold in a good way. It's not hard to get it
back to kind of working, but it requires some changes that go into the
wrong direction.
The problem basically is that the current old_snapshot_threshold
implementation just reduces OldestXmin to whatever is indicated by
old_snapshot_threshold, even if not necessary for pruning to do the
specific cleanup that's about to be done. If OldestXmin < threshold,
it'll set shared state that fails all older accesses. But that doesn't
really work well with approach in the patch of using a lower/upper
boundary for potentially valid xmin horizons.
I thinkt he right approach would be to split
TransactionIdLimitedForOldSnapshots() into separate parts. One that
determines the most aggressive horizon that old_snapshot_threshold
allows, and a separate part that increases the threshold after which
accesses need to error out
(i.e. SetOldSnapshotThresholdTimestamp()). Then we can only call
SetOldSnapshotThresholdTimestamp() for exactly the xids that are
removed, not for the most aggressive interpretation.
Unfortunately I think that basically requires changing
HeapTupleSatisfiesVacuum's signature, to take a more complex parameter
than OldestXmin (to take InvisibleToEveryoneState *), which quickly
increases the size of the patch.
I'm currently doing that and seeing how the result makes me feel about
the patch.
Alternatively we also can just be less efficient and call
GetOldestXmin() more aggressively when old_snapshot_threshold is
set. That'd be easier to implement - but seems like an ugly gotcha.
Greetings,
Andres Freund
On Sun, Mar 29, 2020 at 4:40 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Mar 1, 2020 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:
The workload is a pgbench readonly, with pgbench -M prepared -c $conns
-j $conns -S -n for each client count. This is on a machine with 2
Intel(R) Xeon(R) Platinum 8168, but virtualized.conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
This scalability improvement is clearly very significant. There is
little question that this is a strategically important enhancement for
the Postgres project in general. I hope that you will ultimately be
able to commit the patchset before feature freeze.
+1, this is really very cool results.
Despite this patchset is expected to be clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.
Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.
I did measurements on all of those but prepared xacts, fwiw. That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.
Andres
Regards,
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Sun, Mar 29, 2020 at 11:50:10AM -0700, Andres Freund wrote:
Hi,
On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.Definitely. Which workloads are you thinking of? I can think of those
affected facets: snapshot speed, commit speed with writes, connection
establishment, prepared transaction speed. All in the small and large
connection count cases.I did measurements on all of those but prepared xacts, fwiw. That
definitely needs to be measured, due to the locking changes around
procarrayaddd/remove.I don't think regressions besides perhaps 2pc are likely - there's
nothing really getting more expensive but procarray add/remove.
If I get some instructions what tests to do, I can run a bunch of tests
on my machinees (not the largest boxes, but at least something). I don't
have the bandwidth to come up with tests on my own, at the moment.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Mar 29, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote:
On March 29, 2020 11:24:32 AM PDT, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
clearly a big win on majority
of workloads, I think we still need to investigate different workloads
on different hardware to ensure there is no regression.Definitely. Which workloads are you thinking of? I can think of those affected facets: snapshot speed, commit speed with writes, connection establishment, prepared transaction speed. All in the small and large connection count cases.
Following pgbench scripts comes first to my mind:
1) SELECT txid_current(); (artificial but good for checking corner case)
2) Single insert statement (as example of very short transaction)
3) Plain pgbench read-write (you already did it for sure)
4) pgbench read-write script with increased amount of SELECTs. Repeat
select from pgbench_accounts say 10 times with different aids.
5) 10% pgbench read-write, 90% of pgbench read-only
I did measurements on all of those but prepared xacts, fwiw
Great, it would be nice to see the results in the thread.
That definitely needs to be measured, due to the locking changes around procarrayaddd/remove.
I don't think regressions besides perhaps 2pc are likely - there's nothing really getting more expensive but procarray add/remove.
I agree that ProcArrayAdd()/Remove() should be first subject of
investigation, but other cases should be checked as well IMHO.
Regarding 2pc I can following scenarios come to my mind:
1) pgbench read-write modified so that every transaction is prepared
first, then commit prepared.
2) 10% of 2pc pgbench read-write, 90% normal pgbench read-write
3) 10% of 2pc pgbench read-write, 90% normal pgbench read-only
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.
As an example for under-documented, here's a definitely non-trivial
block of code without a single comment explaining what it's doing.
if (oldSnapshotControl->count_used > 0 &&
ts >= oldSnapshotControl->head_timestamp)
{
int offset;
offset = ((ts - oldSnapshotControl->head_timestamp)
/ USECS_PER_MINUTE);
if (offset > oldSnapshotControl->count_used - 1)
offset = oldSnapshotControl->count_used - 1;
offset = (oldSnapshotControl->head_offset + offset)
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
xlimit = oldSnapshotControl->xid_by_minute[offset];
if (NormalTransactionIdFollows(xlimit, recentXmin))
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}
LWLockRelease(OldSnapshotTimeMapLock);
Also, SetOldSnapshotThresholdTimestamp() acquires a separate spinlock -
not great to call that with the lwlock held.
Then there's this comment:
/*
* Failsafe protection against vacuuming work of active transaction.
*
* This is not an assertion because we avoid the spinlock for
* performance, leaving open the possibility that xlimit could advance
* and be more current; but it seems prudent to apply this limit. It
* might make pruning a tiny bit less aggressive than it could be, but
* protects against data loss bugs.
*/
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;
if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;
So this is not using lock, so the values aren't accurate, but it avoids
data loss bugs? I also don't know which spinlock is avoided on the path
here as mentioend - the acquisition is unconditional.
But more importantly - if this is about avoiding data loss bugs, how on
earth is it ok that we don't go through these checks in the
old_snapshot_threshold == 0 path?
/*
* Zero threshold always overrides to latest xmin, if valid. Without
* some heuristic it will find its own snapshot too old on, for
* example, a simple UPDATE -- which would make it useless for most
* testing, but there is no principled way to ensure that it doesn't
* fail in this way. Use a five-second delay to try to get useful
* testing behavior, but this may need adjustment.
*/
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyProc->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;
ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);
return xlimit;
}
This feature looks like it was put together by applying force until
something gave, and then stopping just there.
Greetings,
Andres Freund
Hi,
On 2020-03-31 13:04:38 -0700, Andres Freund wrote:
I'm still fighting with snapshot_too_old. The feature is just badly
undertested, underdocumented, and there's lots of other oddities. I've
now spent about as much time on that feature than on the whole rest of
the patchset.
To expand on this being under-tested: The whole time mapping
infrastructure is not tested, because all of that is bypassed when
old_snapshot_threshold = 0. And old_snapshot_threshold = 0 basically
only exists for testing. The largest part of the complexity of this
feature are TransactionIdLimitedForOldSnapshots() and
MaintainOldSnapshotTimeMapping() - and none of the complexity is tested
due to the tests running with old_snapshot_threshold = 0.
So we have test only infrastructure that doesn't allow to actually test
the feature.
And the tests that we do have don't have a single comment explaining
what the expected results are. Except for the newer
sto_using_hash_index.spec, they just run all permutations. I don't know
how those tests actually help, since it's not clear why any of the
results are the way they are. And which just are the results of
bugs. Ore not affected by s_t_o.
Greetings,
Andres Freund
On Sun, 1 Mar 2020 at 21:47, Andres Freund <andres@anarazel.de> wrote:
On 2020-03-01 00:36:01 -0800, Andres Freund wrote:
conns tps master tps pgxact-split
1 26842.492845 26524.194821
10 246923.158682 249224.782661
50 695956.539704 709833.746374
100 1054727.043139 1903616.306028
200 964795.282957 1949200.338012
300 906029.377539 1927881.231478
400 845696.690912 1911065.369776
500 812295.222497 1926237.255856
600 888030.104213 1903047.236273
700 866896.532490 1886537.202142
800 863407.341506 1883768.592610
900 871386.608563 1874638.012128
1000 887668.277133 1876402.391502
1500 860051.361395 1815103.564241
2000 890900.098657 1775435.271018
3000 874184.980039 1653953.817997
4000 845023.080703 1582582.316043
5000 817100.195728 1512260.802371I think these are pretty nice results.
FWIW, I took this for a spin on an AMD 3990x:
# setup
pgbench -i postgres
#benchmark
#!/bin/bash
for i in 1 10 50 100 200 300 400 500 600 700 800 900 1000 1500 2000
3000 4000 5000;
do
echo Testing with $i connections >> bench.log
pgbench2 -M prepared -c $i -j $i -S -n -T 60 postgres >> bench.log
done
pgbench2 is your patched version pgbench. I got some pretty strange
results with the unpatched version. Up to about 50 million tps for
excluding connection establishing, which seems pretty farfetched
connections Unpatched Patched
1 49062.24413 49834.64983
10 428673.1027 453290.5985
50 1552413.084 1849233.821
100 2039675.027 2261437.1
200 3139648.845 3632008.991
300 3091248.316 3597748.942
400 3056453.5 3567888.293
500 3019571.47 3574009.053
600 2991052.393 3537518.903
700 2952484.763 3553252.603
800 2910976.875 3539404.865
900 2873929.989 3514353.776
1000 2846859.499 3490006.026
1500 2540003.038 3370093.934
2000 2361799.107 3197556.738
3000 2056973.778 2949740.692
4000 1751418.117 2627174.81
5000 1464786.461 2334586.042
Attached as a graph as well.
Likewise.
David
Attachments:
excluding_connections_est_3990x.pngimage/png; name=excluding_connections_est_3990x.pngDownload
�PNG
IHDR y K�g� sRGB ��� gAMA ���a pHYs % %IR$� �/IDATx^���Gz����?�S�����N��V+)N{G��]r�k�@� =� H� �� A����{����{��o����7��PS��������~����������*3���/B!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�%a��M��K�?�Z�����P�G����r �������H������7N<Ol�M��^�7�_��?�9s���V�k������&��t�d�4g*n�[&L� �{��[�n���{�_�w��Z����)����}]���w�X�b�a�U��^$�|�������1��`�t�p&��H�/�eee2d�4h����yIX�n�t��A��G����.�n�4��������%Kd���U������Gec#F����|6���U��O�>r������a���O�m�W�\Qk�s�(��a�������z�j���/�_{�R?�/�;Ur�~IV���Vu��.�i���w,[������b/*;v�N�:I�����CynS6�I�?��4�K+��={���_~�>f��cN}�"/ ����u�!W*abz���U��T�g#x��L5���"/s�p���9������F����6�+�K����P��?��>l����������������t���E2����1c���������f?yI����(
�@@.B.}� �%Y��
�ac>�/����m�������I&�L�Q�hx�B!+]�����4^�k�c��������P��&��>7E^�!]"�8w��U�8�N������U|�������7cgJ��]:���iK&���c��G��f�����KB}����<B�A�T��Hu��b��a�!��3A&5<j�C����h���U]��_m`����6����\��)�*{�������C�I7��]&����<|�5j�z�U�(��P*9�<b�\��3��6�S�^������T�sT�Ljx����I4��6T����d���m��M'�R��+�xU����wr��]���3g���S�.���TD��<��^�zIAAAlk�R�"_�/_���t��U�!@F��7O����u�K�.U�-Zw��a���� ��x��r����������*o;�7o���.Y%7w�\u��={&����%��t�K����+q�?���{H&��n�G8q�Dlk�u1Oc���}��ajl2�6%�����w�}���_���mun\��������xH����:xnHG"p��-[���c����'���'���?�q��+W�y-�������'���|��@�uey������^�wL�� �/|���
�����?>�{�_�w{��YI����Ca�����! �0T����5��k������@�m������������W_Y����-�s��{��U�����@���J\4���N���� �������F�>�F�����yn��n���~�+ �;�{����IG� j��<u����#Q=dB���i�m����xx&�9�g>#/�M���'M���������*���xe �8|���e�S�!���>>��i'^�^������e4���^�xQ�O�P�5�9��^�S�'��� ����w8��U�������y y�c�e�&//O�3��!Q�
�j���c����Q����
;�b���)S����<����Wx��Zy0��;w�����\m��
���a�}�����s�����{�������4������q-��QI�`�q�F��(�*9g���d'��s�+���N�*�����4��>\w��m*���;&�������/�.Dt��($���1�U���m���t�����u�?(� �����g�<y���8q����������������O��PN��� �5v��C�l%����p�n��%|����svt����
y;8*�d�v���#�`�����m`����v��<����)�N�3��s'vDE��>�����`�g��5��r>?����@ZaC�� �8w�2�M�n4����>7�Mb^B�}<��m�y]�J�kR.�F��k��
h�s���O��R�����o��e�����y�{���Wo���������`�������.^p�>F`�����>?�;�x}
�� �5�=��x�AM�O�M(��][�#:�x^��H�
:�������De����n��8���s��[���:���}����.��4�,�3�W����1[�� U�n;+e-H�
����x�>�={V����U4h+M�6M�C���M�-yx��r�S?�Dq�n����
q������/�Ff?��(�
�m/��Ur��;������!��J�.��h\���iH%}xt�%
�7�I'bP�s��p�)0Q�����������G�Vv,� ���������G���s�Y�&v�=���xv��l%QC y��;��O���������3u���;C���4���<��Fp>T&�w�V*����0HT�$�4\p�=-��g�x��Ku�g�����_���;��g��b�(�P�;����7�=�"��eTM��=�IHs������rRI3�Y�r�6�N��~^�
��{v��T�Q
=;`�?�Q�z�8r����/�3{�le�cu�>g� �������C���m��� ��E{�Q_�G��,��Y��~k�~:����"zW���c����H�9 �/^|_:u^au����A�D��V���������i@giNf����3����pu�Yx�5-k�a�������=]&y@�E�$�-y�����A@�@�C����.@�5 =svP����#�p��
G},
xs���@�(P"M.�K����_?+}���d��~�*{�u<g�y��e���>�y����d�������/
��O�#����/2���]ooH�U������3��k|i�����&^BT��x{��</<7�GS���X=��PIa�5�f?V����k��c5��C@!�������E6��8���4�����{���p�X����=p�@��(�{��b�P@"�:� x1d�2q
w�v�?�Gl��4������uuc��v}�3fX6��;+����!��P9`�f�e ��/��������������26��
��
��<��`H�+���MG��|p
|��>4z�����D����G��,������p��K���sBp���Q.#��l��"��eTM��=�8���y���VCq��l'�4#T�\����t��!l����j�T���{�!8�V�#(�t��,��@�>?�u4��i�?���;�G<g����J� �upH���m�X�G��E��|�9P^b���PY�����zu�O����`��QF#�p������*��v�>=u:��9���� ���5��� �����ku��T�'��xo� ���:]�����A���o��f�/���0$�I;+emMm���3C}�v���6��:Ne"e���&��'��<4t��@�Q��S���1���62��aoD ���B�,^ ]��!������/k��s�e�6��a@f�j�J Y%���^t�^p�� ��7h������t�E�/O�4��>=<�^ ���uc_��
|U��
�r>74�.]�����u�Z`������9��|�m���8(� z.
���BA�4it�0�y��f�a�����G�g?���8P�:EB:�Yn�U��3p�O����B��P��Y��&*E�q~��g�{!�,8�;��xiG:!���-055�?����F@<�� �C�RC�����������H+��E<TP���i��5(`��z��(������~��s{9i����h��A��7�q��DeTM���f��Kg��`�q�xRMsu���(;�+��
�������{���F�^��/�n������5;����F}`��g�;]F O��������8���P���F��Q>�z��~��FH�|C9b/ck�~B���v{�5���Cn!^>$���B��uq���|�s�i�P�S��r���(�{��m=���w�nq-��O-$� ���v0S��h��!��H7��>�N'�l�&��t��5�a����A{�i�s-t*y����3���yx��"#Q!&�������+��1���������E�6�>m\O(�����8���8�<�*9�RUVq�{����
�Y�h����y�4��>4:�y�]���������F+9���e��Z`�^
~�/� �
�~�S=y��D�������
�����NV1�$�a�h�%����a&����hv��*{!o/8��v��~�QX����&����}���3���b�X1���gv�f� l���]j�>m���\y�,�� ���gl���h�;��"h�h�
~<O;������=��'4�!����Q�i�j�`/L��@e4V�hID2JF�g��g�����9p�v���m%z��|��|�B����A���"��/�o<�v]����T�|��l�i���~4��1>Z�w*^>T��{G�=l�s���bM��{�2!�Zt��5�:�y��v�j@���,��nm7��>���,s�W �
����si��NY����v�N����P�����u���yx�Z-�L�p4����h4:� /��c0�x���"���J�JN�T�U��^>{�`xW�|�c���!���������~��� _��/���r�JD�XU)0q���G^��R��Y�t����*�����w���/�p'])$��j�����2�}����'@A�0��#�*{!o/8�iw~Q��~�Q�����&��_��/�b���!h��*����l"m��d�~�;m��j�>m���3v��"=��/���@���?_8�@^'��E�Wc/g�<�$*�j�/�4�����k�3]��������0Uv��f�c���Iul��`;�!�����v{��y���x���oN��N����DT��A����P�V�T�7{]hOC:�O���1M�=�+��T�h�y�l��wU�x��J^���b/����m���#:#�>Ey��,������D}��3�y��Hd�5ig�����)k�a�������M���[���>�I%n�cT��!�� �;|�L������#��/����+"���HVY'��2�S���%��};�b�+������Q�`;
�d�����Z���~��
�x�����kLe������&'�?;:^��LW>�w�;��:�{@�sG���?��
�x�K���B�^���J7�8 ���=�k� �E���b�S4*�w��a�AL|��yP� }��axv����t�^��%�=
�B��7�#'��$���N�3���s����#�k������$z�j�/�l��=�3]�I��x�������n� �e`e�g�t���8��d�g��y��������'z�AM�$;7>f`��l�D"��+��xiHG�I���~x}t��*�������i�{��#��������������U}���[�)�0�� ��^�.��^��j������iF9��i�a���xyj��8����S����u=��l��+K+�>�A\�Me�2�"�^�W5$�\Q���_~��w�sTv�d��9���Hx����g���
|��q�i�i�@��!���NC���U�B:������M�������//���U
Z�V�Y%"^�4�����}{*_�����`������_���uEah``<��"��U
N{�I������Cc �@c �%� c>F��CS��g���HF�����i������K"�tb��LWu�l'��IvM'�-;��L�@���)�w����$�?;�x�{�����p
�u�oM�$:7�����>we��K�=����V%��%^>T�c������w��s���GB���$���c;�'�>��i��PS���z(�qB��_�e.�g���Mt��T����=_�S�"jj���� �R��Vv��|��R�Ut7��EF�����#Q��'�*�B/�.���(���~�p�/J�2#8+{mx�^t�W�����:��8��F:�P���x/�@�no�?���K\�_�t�_�/)�c+{�z�t���&Y��IV��$���Ec�1
�x�B����W�Y%"^�4����^p��������<��hw�Q��s�@�����p���J��0$1�a3��6+8_��/�ki<�[��%�=<�
��;�{�� ���E��Wg���/��&_V����7{��d'�w�����&���9�U�4�x��6�N ��4������xN�'�@^c�}������,��$�G������=��ko����A�����z�|U����H<��!����_��������c����hA]������g�4�!����&yU����N���{�����:*p>�oH���Q�#�qF�eq�N�� ���mg�����)k�a�����xf�>�����vUD�/��f:FE*m����d$M�&�+~�4��B���
+TLz�^ {�� �%�a�z<�~���=����������$t�'n������,}���N���K�*x;v�VU-0�#<S��;H*i<�;
�T���!���0"Y��IV��$�Q���(�+�[��Be�h�9��GL~F>�� �K��c;����\��`���qk��@#�EE���ZLW�����z��I��Kf{���F�|!���Or����V6�Y���pm
���a���T�1M*���&���9���4�{6�Uv]U���L�8��&�3��D]�:�N���c���G��=�<(Q>"��IB:�$+[��Q���x�}����:�f��r}n���#^��~J5��Vu��C���ht�������&yU�����]���:�Y��#��ze��x`��C��;��'�����
����)k�a�����++'��u�U�����~�I�J���y0X�1��Id�x �Z���J�9A�Lx!��>T8�A����T�[*�D_�PA��^}t��_fge�F2�'2<���I��_>��;w���|���#�=��xiH����@��F�Q9��@�O�=3BUL���c!�������{���^5r0f^���G��J�v��5�g��#�,X���A��;�
��������
��f��Di���Yp���
6����q�B\�v!Q���yP@#O�w<P�q���"��_|1$m��iCZ�{���%�==_��.�����������������^���7���><';�����K2��c��LWU��D��6�N�5������"z���e���U������=�������W�#���:-�����l� M��7�b���v�Am\�s;m�I�4����wF7���6�k8��
��Fv��v��f��x���I^�{��L�{����*��'<�D��(�g���w�(O���T�v��sb[u��t�0>�a4
���������
��E���y �KI���@q�
�����
_7�/(�k�����;%��bp��4p|����Q����� ������B�WB��^�@��4dq/8��9
}^����w�u��x�=h���#8��,}���n���J��)��c����!���
_}U-01,A7>��ax���9�y���v�`?i�WT6Z$�cQ��<��v{�,��$+�k���9��m��}��/8^�Gh�������8��cp�v{��#���6�G��{a�,8��u>?}]��~?��.�k�8��p��6�{G�"�%�����_�q~�w���G�;C��x(t�;�������;{��� |�BY��������NA`�x����H4VQ��,�����xn8�����{�j�/�l��=�3]UMs<��Qv������s"����~~��#����2u+>��|�=�{/���{<�=��y�<A���y�������3��H���,�����C��2@���A������s;m���g/�@M�OH����!�>���=����Q�4T�����)mW���8:-�7;5�+�{�g����D~ /�N��{����*H��~��]N ��y���i�7U�I<�I;+emMm����rB������<\Ox���l���C���H�����(���_<��*��C��������.@�>?���/�x�� �����wm�e��WB����P2�W�����������3_��. �+�����d���\�#�_�;��L���A�j��g��6<�������~q-C��4�;��eH�,��$+�����E@�y���E��>�v���_�;��Q���3�?���<��a��|��,���"�+���4���OA��w�4��}�n*�2�
��s��~~��:�s�i�to)����A���8^_�yn��5���v/{���w��t�����#�=�����xNv�c5��d6i�������I�r�6�N�Q ��`�@�Q�zA@���9��mq�.�=k���{<�=����E��<A\{zt���u��������#�w�c���u���=����8vq��9��������;���3���3
�����3^���g�%)j�W�
����������@�:�������
�}�?[{9���6���=&����������nY����zyoO?����zx��|lF<\7Q�|�`\�4�0y2��n(����9�i���w6�
1�F
���w!����(������eFpV�H#��!�0H4d�u��+8�/���(_���un�iH�>]V��hH�+`���������D�C?�x�G!�|iq,�`�%�w����e��lA�����POW>c�K"[��������9��d��>L��$������SY!�J�qn8�p�7�i��=L��>4�Q�T�e��x��6�2�)bk�>�WrPb����Q/ ]h �x�4���/���D��Ee�|�������N�w�&�R�Mj�����N��$+�^�.;���{�s�<3��x���:�<FCH;t�,����9����I]�;l e���J���d�@����J`����C�[�l����,�@M�O���#������!8?�\��p�/>x��� f'y�g�JT��-��&��T��m�t�F����=���|�5��"�G����l����t����6��Q�k8������,��m�Z���6jE�d$3���+�6H��h<��F%d�0]��2�w7{��6���Sb?^&o�C�/8�/��4,�2#����� ���s�����=���@0D�� ]�`B���
9\7Q��/&@%������/��v���
8Zp��t�� �EA�k$jl$:i��?���xvS����x��2]�[B���>Pc<>*D�g2��<���;��]���O�v�n���1���PH��AZ�lOTp��������|�g���C�������C">��#}��=��C*R�������]�p!aeQ���X�0�)�����8��~"8/�9�����!���?_��>�]���g�u�� �
�O��N�w�����M����n��$+@m���
�X�<����g���P-�Q�<�����
�W��%Y����s��=��v���M�W(�P�${wkj� Y������PG{��C��C�kg��A<�c<��x�Wg���~�����x���
����U��d��|��u�x����E���v�}�i��+�w8�����Ay[�����U��W�>����sk���"�I�Ek���
���j;+e���6C^���7�G>��G;��XYZq�F��x��4jM�<<d<
K8�
C�W�#>���C�}���+��Ac�~}�u��u<� �p����ua����8+ �3]�����&JC�����\��-��������nw��Fpu��q?�y;���D���=7���:^��Lg>��}�':���{��-Y>�������tz��x$�.�;�|A����%�;��X`#��=V���&��u�����z�����K�W�C���H�����9 ��}�����
������w!�;V�|���2�t���&i��x8O"�u>��������CC
���d���[g��\�O<�Dy�I�v���_v��e\�^��/��,-�<F���}����k�N��:���t!�'���!����b�}�d�w�x���cu�RIC2���s�_}}g��+���I2�L�T����|.���{��\B\}_�xg����~-�<����'z^v��p�D$�}��/�:����`[��b�(�K�:{D��Zy$s�WP������k
8|�AE��1��b��k�f����`�����h�d
��!��d@Hjg7�Hfy$.�F�_�1qWiAc��x����8�"C!�Lm��8k�W�3�<BHm���.�c�p.@�G�7��)�ncL;�N��}z>�b�Q.|!��d�v��c�t ��|�L��F�G�-��M;����x�S��<|q�gO=Y=Q@�N�3��B���(;���C�J8k�(�!��LN� |�B�G���Xz��� �?����!���.�Nx��qx���y���@i�0Mx��%(�HR0^�0�n��*���k!��{�v���i�s�P.#ot����BL��B/S��tB�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�B!�BHA�G!�B!YE!�B!�dy�B!��EP�BH=���X>��3Y�fMl��`� n]3q���i���#G�����k���N]�m�/�U>_B�L(�!�QXX(����,X� ��~�q�.��x��w��5{�����?���k���N]�m�/�����o����zcG��/!�d.y�R��$��JZ�A�^�s=z�\�x�B8u���j�J������=^|����P�BH=�:"������_�R5v�c����y�!�v�mw���/�p>����8��k>�������������Z8��X��?{������=��\gz'WD@2� ��r�����-K�,�m��%��l�"�B����q}5��bgL���<�����w���g�
���{[���q�z����&M������5��G��M�6j���:=�1�s��?���V�];j�(�]�q�W���W�^�����6D�{�<)7����pV���S���?^{�1k?�/!�d?y��`�:�4���q���;c��D���f��Mr��9y��|�����3�����U<��6��x���{N�}�Y�r���w��!�h���K�.���O��K����X�"���7%�!�h���H�=-|�A��x��
q���w.�{zW�Z����:���Hm� ������w5��G���:�lB�G�J��8qBm��%����"�B@�=������E�� ���^��f������8�V�Xa�C�}���*�����64�qLYY��=4�����_��_+�/^Z! 6����� �CD�P��9r�����L/������ DH�&M��6L��v
�����������jj"�����M='{x��G�������g��K!�E!�8�T��s�N������?Ts� �C���3�M^^�jT�1��p9- p~����u�������Z"��=P_}����G?Rn�uZ�?�#�:^��m��v����-PD�~F� 1�mE��K!�
E!�8�"�.��`_<��l�:���n�I��k���]�zupl��-����r����zN���;Z=8�H%�8���������f����>�<<<'g(--�����K!�E!�8�K���
x!��d�����qt�F���s;����o��F�ap��5{�l�?D�n�c�J����%�yW�H%��&�r�E�3X��|�M)**��)���K�^��A���C2�| !$���#�K���������]S7j1\�O�>j����
�G7�SiT4����l H���;����
�5������������8�l��Y ��={VH�����7���r��jzq�����i�8���_�����4�uL�Y�H\3&�8�<U��"���B��<BqP�
�-T������C�/~c�� E��j� �yj���r��s�/����������<�I���
j����8h����X
�3��#�c�������<?���T\�~�B<�5�����{�����JUE�/!�d?y�R���B4pu#��
�G�j�"