scalability bottlenecks with (many) partitions (and more)
Hi,
I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.
Consider a partitioned pgbench with 20 partitions, say:
pgbench -i -s 100 --partitions 100 testdb
but let's modify the pgbench_accounts a little bit:
ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
UPDATE pgbench_accounts SET aid_parent = aid;
CREATE INDEX ON pgbench_accounts(aid_parent);
VACUUM FULL pgbench_accounts;
which simply adds "aid_parent" column which is not a partition key. And
now let's do a query
SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid
so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:
QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost=1.52..34.41 rows=10 width=465)
Hash Cond: (pa.bid = pb.bid)
-> Append (cost=0.29..33.15 rows=10 width=101)
-> Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> ...
-> Hash (cost=1.10..1.10 rows=10 width=364)
-> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10
width=364)
So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:
parts 1 8 16 32 64 96 160 224
-----------------------------------------------------------------------
0 13877 105732 210890 410452 709509 844683 1050658 1163026
100 653 3957 7120 12022 12707 11813 10349 9633
1000 20 142 270 474 757 808 567 427
These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.
Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.
But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.
I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.
Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:
dims 1 8 16 32 64 96 160 224
----------------------------------------------------------------------
1 11737 92925 183678 361497 636598 768956 958679 1042799
10 462 3558 7086 13889 25367 29503 25353 24030
100 4 31 61 122 231 292 292 288
So, similar story - significant slowdown as we're adding dimensions.
Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.
So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.
From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.
For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.
For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...
What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)
Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:
1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 21610 137450 247541 300902 270932 229692 191454 189233
patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203
speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8
That's a pretty nice speedup, I think.
However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:
9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault
|
--8.39%--asm_exc_page_fault
|
--7.52%--exc_page_fault
|
--7.13%--do_user_addr_fault
|
--6.64%--handle_mm_fault
|
--6.29%--__handle_mm_fault
|
|--2.17%--__mem_cgroup_charge
| |
| |--1.25%--charge_memcg
| | |
| | --0.57%-- ...
| |
| --0.67%-- ...
|
|--2.04%--vma_alloc_folio
After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.
And with both patches applied, the results for the partitioned join with
100 partitions look like this:
-M simple
1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 653 3957 7120 12022 12707 11813 10349 9633
both patches 954 7356 14580 28259 51552 65278 70607 69598
speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2
-M prepared
1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 1639 8273 14138 14746 13446 14001 11129 10136
both patches 4792 30102 62208 122157 220984 267763 315632 323567
speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9
That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.
I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).
It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.
FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.
I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.
FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.
FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:
54.18% 0.48% postgres [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
|
--53.70%--entry_SYSCALL_64_after_hwframe
|
--53.03%--do_syscall_64
|
|--28.29%--__x64_sys_openat
| |
| --28.14%--do_sys_openat2
| |
| |--23.14%--do_filp_open
| | |
| | --22.72%--path_openat
That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...
There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at
https://github.com/tvondra/scalability-patches
There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
v240118-0001-Increase-NUM_LOCK_PARTITIONS-to-64.patchtext/x-patch; charset=UTF-8; name=v240118-0001-Increase-NUM_LOCK_PARTITIONS-to-64.patchDownload
From 98a361f95c2c4969488c2286f8aa560b45f8c0a8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 8 Jan 2024 00:32:22 +0100
Subject: [PATCH v240118 2/4] Increase NUM_LOCK_PARTITIONS to 64
The LWLock table has 16 partitions by default, which may be a bottleneck
on systems with many cores, which are becoming more and more common. This
increases the number of partitions to 64, to reduce the contention.
This may affect cases that need to process the whole table and lock all
the partitions. But there's not too many of those cases, especially in
performance sensitive paths, and the increase from 16 to 64 is not that
significant to really matter.
---
src/include/storage/lwlock.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 167ae342088..2a10efce224 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -95,7 +95,7 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
#define NUM_BUFFER_PARTITIONS 128
/* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS 4
+#define LOG2_NUM_LOCK_PARTITIONS 6
#define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS)
/* Number of partitions the shared predicate lock tables are divided into */
--
2.43.0
v240118-0002-Increase-the-number-of-fastpath-locks.patchtext/x-patch; charset=UTF-8; name=v240118-0002-Increase-the-number-of-fastpath-locks.patchDownload
From ef103606034b6bc883414a73b12f3134ebfe460b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 7 Jan 2024 22:48:22 +0100
Subject: [PATCH v240118 1/4] Increase the number of fastpath locks
The 16 fastpath locks as defined by FP_LOCK_SLOTS_PER_BACKEND may be a
bottleneck with many partitions (or relations in general - e.g. large
joins). This applies especially to many-core systems, but not only.
This increases the numeber of fastpath slots per backend from to 1024
(from 16). This is implemented as hash table of 64 "entries", where an
entry is essentially the current array of 16 fastpath slots. A relation
is mapped to one of the 64 entries by hash(relid), and then following
the existing lookup / eviction logic.
This provides better locality than open-addressing hash tables.
It's not clear if 1024 is the right trade-off. It does make the PGPROC
entry larger, but it's already quite large and no regressions were
observed during benchmarking.
---
src/backend/storage/lmgr/lock.c | 134 ++++++++++++++++++++++++++------
src/include/storage/proc.h | 9 ++-
2 files changed, 116 insertions(+), 27 deletions(-)
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c70a1adb9ad..3461626eaff 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -169,7 +169,8 @@ typedef struct TwoPhaseLockRecord
* our locks to the primary lock table, but it can never be lower than the
* real value, since only we can acquire locks on our own behalf.
*/
-static int FastPathLocalUseCount = 0;
+static bool FastPathLocalUseInitialized = false;
+static int FastPathLocalUseCounts[FP_LOCK_GROUPS_PER_BACKEND];
/*
* Flag to indicate if the relation extension lock is held by this backend.
@@ -189,20 +190,23 @@ static bool IsRelationExtensionLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
/* Macros for manipulating proc->fpLockBits */
#define FAST_PATH_BITS_PER_SLOT 3
#define FAST_PATH_LOCKNUMBER_OFFSET 1
+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) % FP_LOCK_GROUPS_PER_BACKEND)
+#define FAST_PATH_LOCK_INDEX(n) ((n) % FP_LOCK_SLOTS_PER_GROUP)
+#define FAST_PATH_LOCK_GROUP(n) ((n) / FP_LOCK_SLOTS_PER_GROUP)
#define FAST_PATH_MASK ((1 << FAST_PATH_BITS_PER_SLOT) - 1)
#define FAST_PATH_GET_BITS(proc, n) \
- (((proc)->fpLockBits >> (FAST_PATH_BITS_PER_SLOT * n)) & FAST_PATH_MASK)
+ (((proc)->fpLockBits[(n)/16] >> (FAST_PATH_BITS_PER_SLOT * FAST_PATH_LOCK_INDEX(n))) & FAST_PATH_MASK)
#define FAST_PATH_BIT_POSITION(n, l) \
(AssertMacro((l) >= FAST_PATH_LOCKNUMBER_OFFSET), \
AssertMacro((l) < FAST_PATH_BITS_PER_SLOT+FAST_PATH_LOCKNUMBER_OFFSET), \
- AssertMacro((n) < FP_LOCK_SLOTS_PER_BACKEND), \
- ((l) - FAST_PATH_LOCKNUMBER_OFFSET + FAST_PATH_BITS_PER_SLOT * (n)))
+ AssertMacro((n) < FP_LOCKS_PER_BACKEND), \
+ ((l) - FAST_PATH_LOCKNUMBER_OFFSET + FAST_PATH_BITS_PER_SLOT * (FAST_PATH_LOCK_INDEX(n))))
#define FAST_PATH_SET_LOCKMODE(proc, n, l) \
- (proc)->fpLockBits |= UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)
+ (proc)->fpLockBits[FAST_PATH_LOCK_GROUP(n)] |= UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)
#define FAST_PATH_CLEAR_LOCKMODE(proc, n, l) \
- (proc)->fpLockBits &= ~(UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l))
+ (proc)->fpLockBits[FAST_PATH_LOCK_GROUP(n)] &= ~(UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l))
#define FAST_PATH_CHECK_LOCKMODE(proc, n, l) \
- ((proc)->fpLockBits & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
+ ((proc)->fpLockBits[FAST_PATH_LOCK_GROUP(n)] & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
/*
* The fast-path lock mechanism is concerned only with relation locks on
@@ -895,6 +899,12 @@ LockAcquireExtended(const LOCKTAG *locktag,
log_lock = true;
}
+ if (!FastPathLocalUseInitialized)
+ {
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
+
/*
* Attempt to take lock via fast path, if eligible. But if we remember
* having filled up the fast path array, we don't attempt to make any
@@ -906,7 +916,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
* for now we don't worry about that case either.
*/
if (EligibleForRelationFastPath(locktag, lockmode) &&
- FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
+ FastPathLocalUseCounts[FAST_PATH_LOCK_REL_GROUP(locktag->locktag_field2)] < FP_LOCK_SLOTS_PER_GROUP)
{
uint32 fasthashcode = FastPathStrongLockHashPartition(hashcode);
bool acquired;
@@ -1932,6 +1942,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
PROCLOCK *proclock;
LWLock *partitionLock;
bool wakeupNeeded;
+ int group;
if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
elog(ERROR, "unrecognized lock method: %d", lockmethodid);
@@ -2025,9 +2036,19 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
*/
locallock->lockCleared = false;
+ if (!FastPathLocalUseInitialized)
+ {
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
+
+ group = FAST_PATH_LOCK_REL_GROUP(locktag->locktag_field2);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
/* Attempt fast release of any lock eligible for the fast path. */
if (EligibleForRelationFastPath(locktag, lockmode) &&
- FastPathLocalUseCount > 0)
+ FastPathLocalUseCounts[group] > 0)
{
bool released;
@@ -2595,12 +2616,27 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
static bool
FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
{
+ uint32 i;
uint32 f;
- uint32 unused_slot = FP_LOCK_SLOTS_PER_BACKEND;
+ uint32 unused_slot = FP_LOCKS_PER_BACKEND;
+
+ int group = FAST_PATH_LOCK_REL_GROUP(relid);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
+ if (!FastPathLocalUseInitialized)
+ {
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
/* Scan for existing entry for this relid, remembering empty slot. */
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
{
+ f = group * FP_LOCK_SLOTS_PER_GROUP + i;
+
+ Assert(f >= 0 && f < FP_LOCKS_PER_BACKEND);
+
if (FAST_PATH_GET_BITS(MyProc, f) == 0)
unused_slot = f;
else if (MyProc->fpRelId[f] == relid)
@@ -2612,11 +2648,11 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
}
/* If no existing entry, use any empty slot. */
- if (unused_slot < FP_LOCK_SLOTS_PER_BACKEND)
+ if (unused_slot < FP_LOCKS_PER_BACKEND)
{
MyProc->fpRelId[unused_slot] = relid;
FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
- ++FastPathLocalUseCount;
+ ++FastPathLocalUseCounts[group];
return true;
}
@@ -2632,12 +2668,27 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
static bool
FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode)
{
+ uint32 i;
uint32 f;
bool result = false;
- FastPathLocalUseCount = 0;
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ int group = FAST_PATH_LOCK_REL_GROUP(relid);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
+ if (!FastPathLocalUseInitialized)
{
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
+
+ FastPathLocalUseCounts[group] = 0;
+ for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
+ {
+ f = group * FP_LOCK_SLOTS_PER_GROUP + i;
+
+ Assert(f >= 0 && f < FP_LOCKS_PER_BACKEND);
+
if (MyProc->fpRelId[f] == relid
&& FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
{
@@ -2647,7 +2698,7 @@ FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode)
/* we continue iterating so as to update FastPathLocalUseCount */
}
if (FAST_PATH_GET_BITS(MyProc, f) != 0)
- ++FastPathLocalUseCount;
+ ++FastPathLocalUseCounts[group];
}
return result;
}
@@ -2665,7 +2716,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
{
LWLock *partitionLock = LockHashPartitionLock(hashcode);
Oid relid = locktag->locktag_field2;
- uint32 i;
+ uint32 i, j, group;
/*
* Every PGPROC that can potentially hold a fast-path lock is present in
@@ -2701,10 +2752,18 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
continue;
}
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ group = FAST_PATH_LOCK_REL_GROUP(relid);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
+ for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
{
uint32 lockmode;
+ f = group * FP_LOCK_SLOTS_PER_GROUP + j;
+
+ Assert(f >= 0 && f < FP_LOCKS_PER_BACKEND);
+
/* Look for an allocated slot matching the given relid. */
if (relid != proc->fpRelId[f] || FAST_PATH_GET_BITS(proc, f) == 0)
continue;
@@ -2735,6 +2794,7 @@ FastPathTransferRelationLocks(LockMethod lockMethodTable, const LOCKTAG *locktag
/* No need to examine remaining slots. */
break;
}
+
LWLockRelease(&proc->fpInfoLock);
}
return true;
@@ -2755,14 +2815,28 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
PROCLOCK *proclock = NULL;
LWLock *partitionLock = LockHashPartitionLock(locallock->hashcode);
Oid relid = locktag->locktag_field2;
- uint32 f;
+ uint32 f, i;
+
+ int group = FAST_PATH_LOCK_REL_GROUP(relid);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
+ if (!FastPathLocalUseInitialized)
+ {
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
LWLockAcquire(&MyProc->fpInfoLock, LW_EXCLUSIVE);
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ for (i = 0; i < FP_LOCK_SLOTS_PER_GROUP; i++)
{
uint32 lockmode;
+ f = group * FP_LOCK_SLOTS_PER_GROUP + i;
+
+ Assert(f >= 0 && f < FP_LOCKS_PER_BACKEND);
+
/* Look for an allocated slot matching the given relid. */
if (relid != MyProc->fpRelId[f] || FAST_PATH_GET_BITS(MyProc, f) == 0)
continue;
@@ -2866,6 +2940,16 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
int count = 0;
int fast_count = 0;
+ int group = FAST_PATH_LOCK_REL_GROUP(locktag->locktag_field2);
+
+ Assert(group >= 0 && group < FP_LOCK_GROUPS_PER_BACKEND);
+
+ if (!FastPathLocalUseInitialized)
+ {
+ FastPathLocalUseInitialized = true;
+ memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
+ }
+
if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
elog(ERROR, "unrecognized lock method: %d", lockmethodid);
lockMethodTable = LockMethods[lockmethodid];
@@ -2902,7 +2986,7 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
*/
if (ConflictsWithRelationFastPath(locktag, lockmode))
{
- int i;
+ int i, j;
Oid relid = locktag->locktag_field2;
VirtualTransactionId vxid;
@@ -2941,10 +3025,14 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *countp)
continue;
}
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+ for (j = 0; j < FP_LOCK_SLOTS_PER_GROUP; j++)
{
uint32 lockmask;
+ f = group * FP_LOCK_SLOTS_PER_GROUP + j;
+
+ Assert(f >= 0 && f < FP_LOCKS_PER_BACKEND);
+
/* Look for an allocated slot matching the given relid. */
if (relid != proc->fpRelId[f])
continue;
@@ -3604,7 +3692,7 @@ GetLockStatusData(void)
LWLockAcquire(&proc->fpInfoLock, LW_SHARED);
- for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; ++f)
+ for (f = 0; f < FP_LOCKS_PER_BACKEND; ++f)
{
LockInstanceData *instance;
uint32 lockbits = FAST_PATH_GET_BITS(proc, f);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 4bc226e36cd..e5752db1faf 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -82,8 +82,9 @@ struct XidCache
* rather than the main lock table. This eases contention on the lock
* manager LWLocks. See storage/lmgr/README for additional details.
*/
-#define FP_LOCK_SLOTS_PER_BACKEND 16
-
+#define FP_LOCK_GROUPS_PER_BACKEND 64
+#define FP_LOCK_SLOTS_PER_GROUP 16 /* don't change */
+#define FP_LOCKS_PER_BACKEND (FP_LOCK_SLOTS_PER_GROUP * FP_LOCK_GROUPS_PER_BACKEND)
/*
* An invalid pgprocno. Must be larger than the maximum number of PGPROC
* structures we could possibly have. See comments for MAX_BACKENDS.
@@ -288,8 +289,8 @@ struct PGPROC
/* Lock manager data, recording fast-path locks taken by this backend. */
LWLock fpInfoLock; /* protects per-backend fast-path state */
- uint64 fpLockBits; /* lock modes held for each fast-path slot */
- Oid fpRelId[FP_LOCK_SLOTS_PER_BACKEND]; /* slots for rel oids */
+ uint64 fpLockBits[FP_LOCK_GROUPS_PER_BACKEND]; /* lock modes held for each fast-path slot */
+ Oid fpRelId[FP_LOCKS_PER_BACKEND]; /* slots for rel oids */
bool fpVXIDLock; /* are we holding a fast-path VXID lock? */
LocalTransactionId fpLocalTransactionId; /* lxid for fast-path VXID
* lock */
--
2.43.0
v240118-0003-Add-a-memory-pool-with-adaptive-rebalancing.patchtext/x-patch; charset=UTF-8; name=v240118-0003-Add-a-memory-pool-with-adaptive-rebalancing.patchDownload
From 4a494f739e33fc170f3f4c35e82401389d3ae1ec Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Sun, 7 Jan 2024 20:40:44 +0100
Subject: [PATCH v240118 3/4] Add a memory pool with adaptive rebalancing
A memory pool handles memory allocations of blocks requested by memory
contexts, and serves as a global cache to reduce malloc overhead. The
pool uses similar doubling logic as memory contexts, allocating blocks
with size 1kB, 2kB, 4kB, ..., 8MB. This covers most requests, because
memory contexts use these block sizes.
Oversized chunks are matched to these sizes too - memory contexts handle
them as special case and allocate them separately (and do not cache them)
as it's not clear if a chunk of the same size would be needed. Regular
chunks can't be freed / returned to the OS, so it would waste memory.
But if we treat them as regular blocks, we can still reuse them for some
other block. And memory pool can actually free the memory, unlike memory
contexts.
The memory pool is adaptive - we track the number of allocations needed
for different sizes, and adjust the capacity of the buckets accordingly.
This happens every ~25k allocations, which seems like a good trade-off.
The total amount of memory for the memory pool is not limited - it might
be (to some extent that was the initial motivation of the memory pool),
but by default there's only a "soft" limit of 128MB to restrict the size
of the cached blocks. If a backend needs less than 128MB of memory, the
difference (128MB - allocated) will be available for cached blocks. If
the backend allocates more than 128MB of memory, it won't fail, the but
cached blocks will be evicted/freed.
The 128MB limit is hardcoded, but it might be specified by GUC if neede.
---
src/backend/access/nbtree/nbtree.c | 3 +
src/backend/utils/mmgr/aset.c | 26 +-
src/backend/utils/mmgr/mcxt.c | 1071 ++++++++++++++++++++++++++++
src/include/utils/memutils.h | 9 +
4 files changed, 1099 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c0852..6069d23cbfb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -361,7 +361,10 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
BTScanPosInvalidate(so->currPos);
BTScanPosInvalidate(so->markPos);
if (scan->numberOfKeys > 0)
+ {
+ // elog(LOG, "btbeginscan alloc B %ld", scan->numberOfKeys * sizeof(ScanKeyData));
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
+ }
else
so->keyData = NULL;
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 2f99fa9a2f6..ac7bf6dcadb 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -441,7 +441,7 @@ AllocSetContextCreateInternal(MemoryContext parent,
* Allocate the initial block. Unlike other aset.c blocks, it starts with
* the context header and its block header follows that.
*/
- set = (AllocSet) malloc(firstBlockSize);
+ set = (AllocSet) MemoryPoolAlloc(firstBlockSize);
if (set == NULL)
{
if (TopMemoryContext)
@@ -579,13 +579,15 @@ AllocSetReset(MemoryContext context)
}
else
{
+ Size size = block->endptr - ((char *) block);
+
/* Normal case, release the block */
context->mem_allocated -= block->endptr - ((char *) block);
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
- free(block);
+ MemoryPoolFree(block, size);
}
block = next;
}
@@ -649,7 +651,7 @@ AllocSetDelete(MemoryContext context)
freelist->num_free--;
/* All that remains is to free the header/initial block */
- free(oldset);
+ MemoryPoolFree(oldset, keepersize);
}
Assert(freelist->num_free == 0);
}
@@ -666,6 +668,7 @@ AllocSetDelete(MemoryContext context)
while (block != NULL)
{
AllocBlock next = block->next;
+ Size size = block->endptr - ((char *) block);
if (!IsKeeperBlock(set, block))
context->mem_allocated -= block->endptr - ((char *) block);
@@ -675,7 +678,9 @@ AllocSetDelete(MemoryContext context)
#endif
if (!IsKeeperBlock(set, block))
- free(block);
+ {
+ MemoryPoolFree(block, size);
+ }
block = next;
}
@@ -683,7 +688,7 @@ AllocSetDelete(MemoryContext context)
Assert(context->mem_allocated == keepersize);
/* Finally, free the context header, including the keeper block */
- free(set);
+ MemoryPoolFree(set, keepersize);
}
/*
@@ -725,7 +730,7 @@ AllocSetAlloc(MemoryContext context, Size size)
#endif
blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
- block = (AllocBlock) malloc(blksize);
+ block = (AllocBlock) MemoryPoolAlloc(blksize);
if (block == NULL)
return NULL;
@@ -925,7 +930,7 @@ AllocSetAlloc(MemoryContext context, Size size)
blksize <<= 1;
/* Try to allocate it */
- block = (AllocBlock) malloc(blksize);
+ block = (AllocBlock) MemoryPoolAlloc(blksize);
/*
* We could be asking for pretty big blocks here, so cope if malloc
@@ -936,7 +941,7 @@ AllocSetAlloc(MemoryContext context, Size size)
blksize >>= 1;
if (blksize < required_size)
break;
- block = (AllocBlock) malloc(blksize);
+ block = (AllocBlock) MemoryPoolAlloc(blksize);
}
if (block == NULL)
@@ -1011,6 +1016,7 @@ AllocSetFree(void *pointer)
{
/* Release single-chunk block. */
AllocBlock block = ExternalChunkGetBlock(chunk);
+ Size size = block->endptr - ((char *) block);
/*
* Try to verify that we have a sane block pointer: the block header
@@ -1044,7 +1050,7 @@ AllocSetFree(void *pointer)
#ifdef CLOBBER_FREED_MEMORY
wipe_mem(block, block->freeptr - ((char *) block));
#endif
- free(block);
+ MemoryPoolFree(block, size);
}
else
{
@@ -1160,7 +1166,7 @@ AllocSetRealloc(void *pointer, Size size)
blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
oldblksize = block->endptr - ((char *) block);
- block = (AllocBlock) realloc(block, blksize);
+ block = (AllocBlock) MemoryPoolRealloc(block, oldblksize, blksize);
if (block == NULL)
{
/* Disallow access to the chunk header. */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 1336944084d..8364b46c875 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -1640,3 +1640,1074 @@ pchomp(const char *in)
n--;
return pnstrdup(in, n);
}
+
+/*
+ * Memory Pools
+ *
+ * Contexts may get memory either directly from the OS (libc) through malloc
+ * calls, but that has non-trivial overhead, depending on the allocation size
+ * and so on. And we tend to allocate fairly large amounts of memory, because
+ * contexts allocate blocks (starting with 1kB, quickly growing by doubling).
+ * A lot of hot paths also allocate pieces of memory exceeding the size limit
+ * and being allocated as a separate block.
+ *
+ * The contexts may cache the memory by keeping chunks, but it's limited to a
+ * single memory context (as AllocSet freelist), and only for the lifetime of
+ * a particular context instance. When the memory is reset/deleted, all the
+ * blocks are freed and retuned to the OS (libc).
+ *
+ * There's a rudimentary cache of memory contexts blocks, but this only keeps
+ * the keeper blocks, not any other blocks that may be needed.
+ *
+ * Memory pools are attempt to improve this by establishing a cache of blocks
+ * shared by all the memory contexts. A memory pool allocates blocks larger
+ * than 1kB, with doubling (1kB, 2kB, 4kB, ...). All the allocations come
+ * from memory contexts, and are either regular blocks (also starting at 1kB)
+ * or oversized chunks (a couple kB or larger). This means the lower limit
+ * is reasonable - there should be no smaller allocations.
+ *
+ * There's no explicit upper size limit - whatever could be used by palloc()
+ * can be requested from the pool. However, only blocks up to 8MB may be
+ * cached by the pool - larger allocations are not kept after pfree().
+ *
+ * To make the reuse possible, the blocks are grouped into size clasess the
+ * same way AllocSet uses for chunks. There are 14 size classes, starting
+ * at 1kB and ending at 8MB.
+ *
+ * This "rouding" applies even to oversized chunks. So e.g. allocating 27kB
+ * will allocate a 32kB block. This wastes memory, but it means the block
+ * may be reused by "regular" allocations. The amount of wasted memory could
+ * be reduced by using size classes with smaller steps, but that reduces the
+ * likelihood of reusing the block.
+ */
+
+
+#define MEMPOOL_MIN_BLOCK 1024L /* smallest cached block */
+#define MEMPOOL_MAX_BLOCK (8*1024L*1024L) /* largest cached block */
+#define MEMPOOL_SIZES 14 /* 1kB -> 8MB */
+
+/*
+ * Maximum amount of memory to keep in cache for all size buckets. Sets a
+ * safety limit limit set on the blocks kept in the *cached* part of the
+ * pool. Each bucket starts with the same amount of memory (1/14 of this)
+ * and then we adapt the cache depending on cache hits/misses.
+ */
+#define MEMPOOL_SIZE_MAX (128*1024L*1024L)
+
+/*
+ * Maximum number of blocks kept for the whole memory pool. This is used
+ * only to allocate the entries, so we assume all are in the smallest size
+ * bucket.
+ */
+#define MEMPOOL_MAX_BLOCKS (MEMPOOL_SIZE_MAX / MEMPOOL_MIN_BLOCK)
+
+/*
+ * How often to rebalance the memory pool buckets (number of allocations).
+ * This is a tradeoff between the pool being adaptive and more overhead.
+ */
+#define MEMPOOL_REBALANCE_DISTANCE 25000
+
+/*
+ * To enable debug logging for the memory pool code, build with -DMEMPOOL_DEBUG.
+ */
+#ifdef MEMPOOL_DEBUG
+
+#undef MEMPOOL_DEBUG
+#define MEMPOOL_RANDOMIZE(ptr, size) memset((ptr), 0x7f, (size))
+#define MEMPOOL_DEBUG(...) fprintf (stderr, __VA_ARGS__)
+
+#else
+
+#define MEMPOOL_DEBUG(...)
+#define MEMPOOL_RANDOMIZE(ptr, size)
+
+#endif /* MEMPOOL_DEBUG */
+
+
+/*
+ * Entries for a simple linked list of blocks to reuse.
+ */
+typedef struct MemPoolEntry
+{
+ void *ptr; /* allocated block (NULL in empty entries) */
+ struct MemPoolEntry *next;
+} MemPoolEntry;
+
+/*
+ * Information about allocations of blocks of a certain size. We track the number
+ * of currently cached blocks, and also the number of allocated blocks (still
+ * used by the memory context).
+ *
+ * maxcached is the maximum number of free blocks to keep in the cache
+ *
+ * maxallocated is the maximum number of concurrently allocated blocks (from the
+ * point of the memory context)
+ */
+typedef struct MemPoolBucket
+{
+ int nhits; /* allocation cache hits */
+ int nmisses; /* allocation cache misses */
+ int nallocated; /* number of currently allocated blocks */
+ int maxallocated; /* max number of allocated blocks */
+ int ncached; /* number of free blocks (entry list) */
+ int maxcached; /* max number of free blocks to cache */
+ MemPoolEntry *entry;
+} MemPoolBucket;
+
+/*
+ * MemPool - memory pool, caching allocations between memory contexts
+ *
+ * cache - stores free-d blocks that may be reused for future allocations,
+ * each slot is a list of MemPoolEntry elements using the "entries"
+ *
+ * entries - pre-allocated entries for the freelists, used by cache lists
+ *
+ * freelist - list of free cache entries (not used by the cache lists)
+ *
+ * The meaning of the freelist is somewhat inverse - when a block is freed
+ * by the memory context above, we need to add it to the cache. To do that
+ * we get an entry from the freelist, and add it to the cache. So free-ing
+ * a block removes an entry from the mempool freelist.
+ */
+typedef struct MemPool
+{
+ /* LIFO cache of free-d blocks of eligible sizes (1kB - 1MB, doubled) */
+ MemPoolBucket cache[MEMPOOL_SIZES];
+
+ /* pre-allocated entries for cache of free-d blocks */
+ MemPoolEntry entries[MEMPOOL_SIZES * MEMPOOL_MAX_BLOCKS];
+
+ /* head of freelist (entries from the array) */
+ MemPoolEntry *freelist;
+
+ /* memory limit / accounting */
+ int64 mem_allowed;
+ int64 mem_allocated;
+ int64 mem_cached;
+ int64 num_requests;
+} MemPool;
+
+static MemPool *pool = NULL;
+
+static void
+AssertCheckMemPool(MemPool *p)
+{
+#ifdef ASSERT_CHECKING
+ int nused = 0;
+ int nfree = 0;
+ int64 mem_cached = 0;
+ Size block_size = MEMPOOL_MIN_BLOCK;
+
+ Assert(p->mem_allocated >= 0);
+ Assert(p->mem_cached >= 0);
+
+ /* count the elements in the various cache buckets */
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ int count = 0;
+
+ Assert(p->cache[i].ncached >= 0);
+ Assert(p->cache[i].ncached <= p->cache[i].maxcached);
+
+ entry = p->cache[i].entry;
+
+ while (entry)
+ {
+ Assert(entry->ptr);
+
+ entry = entry->next;
+ count++;
+ }
+
+ Assert(count == p->cache[i].ncached);
+
+ nused += count;
+ mem_cached += (count * block_size);
+
+ block_size *= 2;
+ }
+
+ /* now count the elements in the freelist */
+ entry = p->freelist;
+ while (entry)
+ {
+ nfree++;
+ entry = entry->next;
+ }
+
+ Assert(nfree + nused == MEMPOOL_SIZES * MEMPOOL_MAX_BLOCKS);
+ Assert(mem_cached == p->mem_cached);
+#endif
+}
+
+static void MemoryPoolRebalanceBuckets(void);
+static void MemoryPoolEnforceSizeLimit(Size request_size, int index);
+
+/*
+ * MemoryPoolInit
+ * initialize the global memory pool
+ *
+ * Initialize the overall memory pool structure, and also link all entries
+ * into a freelist.
+ */
+static void
+MemoryPoolInit(void)
+{
+ Size size = MEMPOOL_MIN_BLOCK;
+
+ /* bail out if already initialized */
+ if (pool)
+ return;
+
+ /* allocate the basic structure */
+ pool = malloc(sizeof(MemPool));
+ memset(pool, 0, sizeof(MemPool));
+
+ /* initialize the frelist - put all entries to the list */
+ pool->freelist = &pool->entries[0];
+
+ for (int i = 0; i < (MEMPOOL_SIZES * MEMPOOL_MAX_BLOCKS - 1); i++)
+ {
+ if (i < (MEMPOOL_SIZES * MEMPOOL_MAX_BLOCKS - 1))
+ pool->entries[i].next = &pool->entries[i+1];
+ else
+ pool->entries[i].next = NULL;
+ }
+
+ /* set default maximum counts of entries for each size class */
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ pool->cache[i].maxcached = (MEMPOOL_SIZE_MAX / MEMPOOL_SIZES / size);
+ size *= 2;
+ }
+
+ AssertCheckMemPool(pool);
+}
+
+/*
+ * MemoryPoolEntrySize
+ * calculate the size of the block to allocate for a given request size
+ *
+ * The request sizes are grouped into pow(2,n) classes, starting at 1kB and
+ * ending at 8MB. Which means there are 14 size classes.
+ */
+static Size
+MemoryPoolEntrySize(Size size)
+{
+ Size result;
+
+ /*
+ * We shouldn't really get many malloc() for such small elements through
+ * memory contexts, so just use the smallest block.
+ */
+ if (size < MEMPOOL_MIN_BLOCK)
+ return MEMPOOL_MIN_BLOCK;
+
+ /*
+ * We can get various large allocations - we don't want to cache those,
+ * not waste space on doubling them, so just allocate them directly.
+ * Maybe the limit should be separate/lower, like 1MB.
+ */
+ if (size > MEMPOOL_MAX_BLOCK)
+ return size;
+
+ /*
+ * Otherwise just calculate the first block larger than the request.
+ *
+ * XXX Maybe there's a better way to calculate this? The number of loops
+ * should be very low, though (less than MEMPOOL_SIZES, i.e. 14).
+ */
+ result = MEMPOOL_MIN_BLOCK;
+ while (size > result)
+ result *= 2;
+
+ MEMPOOL_DEBUG("%d MempoolEntrySize %lu => %lu\n", getpid(), size, result);
+
+ /* the block size has to be sufficient for the requested size */
+ Assert(size <= result);
+
+ return result;
+}
+
+/*
+ * MemoryPoolEntryIndex
+ * Calculate the cache index for a given entry size.
+ *
+ * XXX Always called right after MemoryPoolEntrySize, so maybe it should be
+ * merged into a single function, so that the loop happens only once.
+ */
+static int
+MemoryPoolEntryIndex(Size size)
+{
+ int blockIndex = 0;
+ Size blockSize = MEMPOOL_MIN_BLOCK;
+
+ /* is size possibly in cache? */
+ if (size < MEMPOOL_MIN_BLOCK || size > MEMPOOL_MAX_BLOCK)
+ return -1;
+
+ /* calculate where to maybe cache the entry */
+ while (blockSize <= MEMPOOL_MAX_BLOCK)
+ {
+ Assert(size >= blockSize);
+
+ if (size == blockSize)
+ {
+ Assert(blockIndex < MEMPOOL_SIZES);
+ return blockIndex;
+ }
+
+ blockIndex++;
+ blockSize *= 2;
+ }
+
+ /* not eligible for caching after all */
+ return -1;
+}
+
+/*
+ * Check that the entry size is valid and matches the class index - if smaller
+ * than 8MB, it needs to be in one of the valid classes.
+ */
+static void
+AssertCheckEntrySize(Size size, int cacheIndex)
+{
+#ifdef USE_ASSERT_CHECKING
+ int blockSize = MEMPOOL_MIN_BLOCK;
+ int blockIndex = 0;
+
+ Assert(cacheIndex >= -1 && cacheIndex < MEMPOOL_SIZES);
+
+ /* all sizes in the valid range should be in one of the slots */
+ if (cacheIndex == -1)
+ Assert(size < MEMPOOL_MIN_BLOCK || size > MEMPOOL_MAX_BLOCK);
+ else
+ {
+ /* calculate the block size / index for the given size */
+ while (size > blockSize)
+ {
+ blockSize *= 2;
+ blockIndex++;
+ }
+
+ Assert(size == blockSize);
+ Assert(cacheIndex == blockIndex);
+ }
+#endif
+}
+
+/*
+ * MemoryPoolAlloc
+ * Allocate a block from the memory pool.
+ *
+ * The block may come either from cache - if available - or from malloc().
+ */
+void *
+MemoryPoolAlloc(Size size)
+{
+ int index;
+ void *ptr;
+
+ MemoryPoolInit();
+
+ pool->num_requests++;
+
+ MemoryPoolRebalanceBuckets();
+
+ /* maybe override the requested size */
+ size = MemoryPoolEntrySize(size);
+ index = MemoryPoolEntryIndex(size);
+
+ /* cross-check the size and index */
+ AssertCheckEntrySize(size, index);
+
+ /* try to enforce the memory limit */
+ MemoryPoolEnforceSizeLimit(size, index);
+
+ /* Is the block eligible to be in the cache? Or is it too large/small? */
+ if (index >= 0)
+ {
+ MemPoolEntry *entry = pool->cache[index].entry;
+
+ /*
+ * update the number of allocated chunks, and the high watermark
+ *
+ * We do this even if there's no entry in the cache.
+ */
+ pool->cache[index].nallocated++;
+ pool->cache[index].maxallocated = Max(pool->cache[index].nallocated,
+ pool->cache[index].maxallocated);
+
+ /*
+ * If we have a cached block for this size, we're done. Remove it
+ * from the cache and return the entry to the freelist.
+ */
+ if (entry != NULL)
+ {
+ /* remember the pointer (we'll reset the entry) */
+ ptr = entry->ptr;
+ entry->ptr = NULL;
+
+ /* remove the entry from the cache */
+ pool->cache[index].entry = entry->next;
+ pool->cache[index].ncached--;
+
+ /* return the entry to the freelist */
+ entry->next = pool->freelist;
+ pool->freelist = entry;
+
+ MEMPOOL_RANDOMIZE(ptr, size);
+ MEMPOOL_DEBUG("%d MemoryPoolAlloc %lu => %d %p HIT\n", getpid(), size, index, ptr);
+
+ /* update memory accounting */
+ Assert(pool->mem_cached >= size);
+
+ pool->mem_cached -= size;
+ pool->mem_allocated += size;
+
+ pool->cache[index].nhits++;
+
+ AssertCheckMemPool(pool);
+
+ return ptr;
+ }
+
+ pool->cache[index].nmisses++;
+ }
+
+ /*
+ * Either too small/large for the cache, or there's no available block of
+ * the right size.
+ */
+ ptr = malloc(size);
+
+ MEMPOOL_RANDOMIZE(ptr, size);
+ MEMPOOL_DEBUG("%d MemoryPoolAlloc %lu => %d %p MISS\n", getpid(), size, index, ptr);
+
+ /* update memory accounting */
+ pool->mem_allocated += size;
+
+ /* maybe we should track the number of over-sized allocations too? */
+ // pool->cache_misses++;
+
+ AssertCheckMemPool(pool);
+
+ return ptr;
+}
+
+/*
+ * MemoryPoolShouldCache
+ * Should we put the entry into cache at the given index?
+ */
+static bool
+MemoryPoolShouldCache(Size size, int index)
+{
+ MemPoolBucket *entry = &pool->cache[index];
+
+ /* not in any pool bucket */
+ if (index == -1)
+ return false;
+
+ /*
+ * Bail out if no freelist entries.
+ *
+ * XXX This shouldn't be possible, as we size the freeslist as if all classes
+ * could have the maximum number of entries (but the actual number grops to
+ * 1/2 with each size class).
+ */
+ if (!pool->freelist)
+ return false;
+
+ /* Memory limit is set, and we'd exceed it? Don't cache. */
+ if ((pool->mem_allowed > 0) &&
+ (pool->mem_allocated + pool->mem_cached + size > pool->mem_allowed))
+ return false;
+
+ /* Did we already reach the maximum size of the size class? */
+ return (entry->ncached < entry->maxcached);
+}
+
+/*
+ * MemoryPoolFree
+ * Free a block, maybe add it to the memory pool cache.
+ */
+void
+MemoryPoolFree(void *pointer, Size size)
+{
+ int index = 0;
+
+ MemoryPoolInit();
+
+ /*
+ * Override the requested size (provided by the memory context), calculate
+ * the appropriate size class index.
+ */
+ size = MemoryPoolEntrySize(size);
+ index = MemoryPoolEntryIndex(size);
+
+ AssertCheckEntrySize(size, index);
+
+ /* check that we've correctly accounted for this block during allocation */
+ Assert(pool->mem_allocated >= size);
+
+ /*
+ * update the number of allocated blocks (if eligible for cache)
+ *
+ * XXX Needs to happen even if we don't add the block to the cache.
+ */
+ if (index != -1)
+ pool->cache[index].nallocated--;
+
+ /*
+ * Should we cache this entry? Do we have entries for the freelist, and
+ * do we have free space in the size class / memory pool as a whole?
+ */
+ if (MemoryPoolShouldCache(size, index))
+ {
+ MemPoolEntry *entry;
+
+ entry = pool->freelist;
+ pool->freelist = entry->next;
+
+ /* add the entry to the cache, update number of entries in this bucket */
+ entry->next = pool->cache[index].entry;
+ pool->cache[index].entry = entry;
+ pool->cache[index].ncached++;
+
+ entry->ptr = pointer;
+
+ MEMPOOL_RANDOMIZE(pointer, size);
+ MEMPOOL_DEBUG("%d MemoryPoolFree %lu => %d %p ADD\n", getpid(), size, index, pointer);
+
+ /* update accounting */
+ pool->mem_cached += size;
+ pool->mem_allocated -= size;
+
+ AssertCheckMemPool(pool);
+
+ return;
+ }
+
+ MEMPOOL_RANDOMIZE(pointer, size);
+ MEMPOOL_DEBUG("%d MemoryPoolFree %lu => %d FULL\n", getpid(), size, index);
+
+ /* update accounting */
+ pool->mem_allocated -= size;
+
+ AssertCheckMemPool(pool);
+
+ free(pointer);
+}
+
+/*
+ * MemoryPoolRealloc
+ * reallocate a previously allocated block
+ *
+ * XXX Maybe this should use the cache too. Right now we just call realloc()
+ * after updating the cache counters. And maybe it should enforce the memory
+ * limit, just like we do in MemoryPoolAlloc().
+ */
+void *
+MemoryPoolRealloc(void *pointer, Size oldsize, Size newsize)
+{
+ void *ptr;
+
+ int oldindex,
+ newindex;
+
+ MemoryPoolInit();
+
+ oldsize = MemoryPoolEntrySize(oldsize);
+ newsize = MemoryPoolEntrySize(newsize);
+
+ /* XXX Maybe if (oldsize >= newsize) we don't need to do anything? */
+
+ oldindex = MemoryPoolEntryIndex(oldsize);
+ newindex = MemoryPoolEntryIndex(newsize);
+
+ if (oldindex != -1)
+ pool->cache[oldindex].nallocated--;
+
+ if (newindex != -1)
+ {
+ pool->cache[newindex].nallocated++;
+ pool->cache[newindex].maxallocated = Max(pool->cache[newindex].nallocated,
+ pool->cache[newindex].maxallocated);
+ }
+
+ MEMPOOL_DEBUG("%d MemoryPoolRealloc old %lu => %p\n", getpid(), oldsize, pointer);
+
+ ptr = realloc(pointer, newsize);
+
+ MEMPOOL_DEBUG("%d MemoryPoolRealloc new %lu => %p\n", getpid(), newsize, ptr);
+
+ /* update accounting */
+ Assert(pool->mem_allocated >= oldsize);
+
+ pool->mem_allocated -= oldsize;
+ pool->mem_allocated += newsize;
+
+ AssertCheckMemPool(pool);
+
+ return ptr;
+}
+
+/*
+ * MemoryPoolRebalanceBuckets
+ * Rebalance the cache capacity for difference size classes.
+ *
+ * The goal of the rebalance is to adapt the cache capacity to changes in the
+ * workload - release blocks of sizes that are no longer needed, allow caching
+ * for new block sizes etc.
+ *
+ * The rebalance happens every MEMPOOL_REBALANCE_DISTANCE allocations - it needs
+ * to happen often enough to adapt to the workload changes, but not too often
+ * to cause significant overhead. The distance also needs to be sufficient to
+ * have a reasonable representation of the allocations.
+ *
+ * The rebalance happens in three phases:
+ *
+ * 1) shrink oversized buckets (maxallocated < maxcached)
+ *
+ * 2) enlarge undersized buckets (maxcached < maxallocated)
+ *
+ * 3) distribute remaining capacity (if any) uniformly
+ *
+ * The reduction in (1) is gradual, i.e. instead of setting maxcached to the
+ * maxallocated value (which may be seen as the minimum capacity needed), we
+ * only go halfway there. The intent is to dampen the transition in case the
+ * current counter is not entirely representative.
+ *
+ * The bucket enlarging in step (2) is proportional to the number of misses
+ * for each bucket (with respect to the total number of misses in the buckets
+ * that are too small). We however don't oversize the bucket - we assign at
+ * most (maxallocated - maxcached) entries, not more in this step.
+ *
+ * Finally, we simply take the remaining unallocated/unassigned memory (up to
+ * MEMPOOL_SIZE_MAX), and distribute it to all the buckets uniformly. That is,
+ * each bucket gets the same amount (rounded to entries of appropriate size).
+ *
+ * XXX Maybe we should have a parameter for the dampening factor in (1), and
+ * not just use 0.5. For example, maybe 0.75 would be better?
+ *
+ * XXX This assumes misses for different buckets are equally expensive, but
+ * that may not be the case. It's likely a miss is proportional to the size
+ * of the block, so maybe we should consider that and use the size as weight
+ * for the cache miss.
+ */
+static void
+MemoryPoolRebalanceBuckets(void)
+{
+ Size block_size;
+ int64 redistribute_bytes;
+ int64 assigned_bytes = 0;
+ int64 num_total_misses = 0;
+
+ /* only do this once every MEMPOOL_REBALANCE_DISTANCE allocations */
+ if (pool->num_requests < MEMPOOL_REBALANCE_DISTANCE)
+ return;
+
+#ifdef MEMPOOL_DEBUG
+ /* print info about the cache and individual size buckets before the rebalance */
+ MEMPOOL_DEBUG("%d mempool rebalance requests %ld allowed %ld allocated %ld cached %ld\n",
+ getpid(), pool->num_requests,
+ pool->mem_allowed, pool->mem_allocated, pool->mem_cached);
+
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ MEMPOOL_DEBUG("%d mempool rebalance bucket %d hit %d miss %d (%.1f%%) maxcached %d cached %d maxallocated %d allocated %d\n",
+ getpid(), i, pool->cache[i].nhits, pool->cache[i].nmisses,
+ pool->cache[i].nhits * 100.0 / Max(1, pool->cache[i].nhits + pool->cache[i].nmisses),
+ pool->cache[i].maxcached, pool->cache[i].ncached,
+ pool->cache[i].maxallocated, pool->cache[i].nallocated);
+ }
+#endif
+
+ /*
+ * Are there buckets with cache that is unnecessarily large? That is, with
+ * (ncached + nallocated > maxallocated). If yes, we release half of that
+ * and put that into a budget that we can redistribute.
+ *
+ * XXX We release half to somewhat dampen the changes over time.
+ */
+ block_size = MEMPOOL_MIN_BLOCK;
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ /*
+ * If the cache is large enough to serve all allocations, try making it
+ * a bit smaller and cut half the extra space (and maybe also free the
+ * unnecessary blocks).
+ */
+ if (pool->cache[i].maxcached > pool->cache[i].maxallocated)
+ {
+ int nentries;
+
+ pool->cache[i].maxcached
+ = (pool->cache[i].maxcached + pool->cache[i].maxallocated) / 2;
+
+ nentries = (pool->cache[i].ncached + pool->cache[i].nallocated);
+ nentries -= pool->cache[i].maxcached;
+
+ /* release enough entries from the cache */
+ while (nentries > 0)
+ {
+ MemPoolEntry *entry = pool->cache[i].entry;
+
+ pool->cache[i].entry = entry->next;
+ pool->cache[i].ncached--;
+
+ free(entry->ptr);
+ entry->ptr = NULL;
+
+ /* add the entry to the freelist */
+ entry->next = pool->freelist;
+ pool->freelist = entry;
+
+ Assert(pool->mem_cached >= block_size);
+
+ /* update accounting */
+ pool->mem_cached -= block_size;
+
+ nentries--;
+ }
+ }
+
+ /* remember how many misses we saw in the undersized buckets */
+ num_total_misses += pool->cache[i].nmisses;
+
+ /* remember how much space we already allocated to this bucket */
+ assigned_bytes += (pool->cache[i].maxcached * block_size);
+
+ /* double the block size */
+ block_size = (block_size << 1);
+ }
+
+ /*
+ * How much memory we can redistribute? Start with the memory limit,
+ * and subtract the space currently allocated and assigned to cache.
+ */
+ redistribute_bytes = Max(pool->mem_allowed, MEMPOOL_SIZE_MAX);
+ redistribute_bytes -= (pool->mem_allocated);
+ redistribute_bytes -= assigned_bytes;
+
+ /*
+ * Make sure it's not negative (might happen if there's a lot of
+ * allocated memory).
+ */
+ redistribute_bytes = Max(0, redistribute_bytes);
+
+ MEMPOOL_DEBUG("%d mempool rebalance can redistribute %ld bytes, allocated %ld bytes, assigned %ld bytes, total misses %ld\n",
+ getpid(), redistribute_bytes, pool->mem_allocated, assigned_bytes, num_total_misses);
+
+ /*
+ * Redistribute the memory based on the number of misses, and reset the
+ * various counters, so that the next round begins afresh.
+ */
+ if (redistribute_bytes > 0)
+ {
+ block_size = MEMPOOL_MIN_BLOCK;
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ int64 nbytes;
+ int nentries;
+
+ /* Are we missing entries in cache for this slot? */
+ if (pool->cache[i].maxcached < pool->cache[i].maxallocated)
+ {
+ int nmissing = (pool->cache[i].maxallocated - pool->cache[i].maxcached);
+
+ /*
+ * How many entries we can add to this size bucket, based on the number
+ * of cache misses?
+ */
+ nbytes = redistribute_bytes * pool->cache[i].nmisses / Max(1, num_total_misses);
+ nentries = (nbytes / block_size);
+
+ /* But don't add more than we need. */
+ nentries = Min(nentries, nmissing);
+
+ pool->cache[i].maxcached += nentries;
+ assigned_bytes += nentries * block_size;
+ }
+
+ /* double the block size */
+ block_size = (block_size << 1);
+ }
+ }
+
+ MEMPOOL_DEBUG("%d mempool rebalance done allocated %ld bytes, assigned %ld bytes\n",
+ getpid(), pool->mem_allocated, assigned_bytes);
+
+ /*
+ * If we still have some memory, redistribute it uniformly.
+ */
+ redistribute_bytes = Max(pool->mem_allowed, MEMPOOL_SIZE_MAX);
+ redistribute_bytes -= (pool->mem_allocated);
+ redistribute_bytes -= assigned_bytes;
+
+ /*
+ * Make sure it's not negative (might happen if there's a lot of
+ * allocated memory).
+ */
+ redistribute_bytes = Max(0, redistribute_bytes);
+
+ MEMPOOL_DEBUG("%d mempool rebalance remaining bytes %ld, allocated %ld bytes, assigned %ld bytes\n",
+ getpid(), redistribute_bytes, pool->mem_allocated, assigned_bytes);
+
+ block_size = MEMPOOL_MIN_BLOCK;
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ int nentries = (redistribute_bytes / MEMPOOL_SIZES / block_size);
+
+ pool->cache[i].maxcached += nentries;
+
+ /* also reset the various counters */
+ pool->cache[i].maxallocated = pool->cache[i].nallocated;
+ pool->cache[i].nhits = 0;
+ pool->cache[i].nmisses = 0;
+
+ /* double the block size */
+ block_size = (block_size << 1);
+ }
+
+ MEMPOOL_DEBUG("%d mempool rebalance done\n", getpid());
+
+#ifdef MEMPOOL_DEBUG
+ /* print some info about cache hit ratio, but only once in a while */
+ block_size = MEMPOOL_MIN_BLOCK;
+ assigned_bytes = 0;
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ MEMPOOL_DEBUG("%d mempool rebalance bucket %d maxcached %d cached %d maxallocated %d allocated %d\n",
+ getpid(), i,
+ pool->cache[i].maxcached, pool->cache[i].ncached,
+ pool->cache[i].maxallocated, pool->cache[i].nallocated);
+
+ assigned_bytes += (pool->cache[i].maxcached * block_size);
+
+ /* double the block size */
+ block_size = (block_size << 1);
+ }
+ MEMPOOL_DEBUG("%d mempool rebalance allocated %ld assigned %ld (total %ld kB)\n",
+ getpid(), pool->mem_allocated, assigned_bytes,
+ (pool->mem_allocated + assigned_bytes) / 1024L);
+#endif
+
+ /* start new rebalance period */
+ pool->num_requests = 0;
+}
+
+/*
+ * MemoryPoolEnforceMaxCounts
+ * release cached blocks exceeding the maxcached for a given bucket
+ *
+ * XXX This gets called only from MemoryPoolSetSizeLimit, which updates the
+ * maxcount based on the memory limit. Maybe it should be integrated into
+ * that directly?
+ *
+ * XXX Or maybe we should simply do the rebalancing for the new limit?
+ */
+static void
+MemoryPoolEnforceMaxCounts(void)
+{
+ Size block_size = MEMPOOL_MAX_BLOCK;
+
+ /* nothing cached, so can't release anything */
+ if (pool->mem_cached == 0)
+ return;
+
+ /*
+ * Walk through the buckets, make sure that no bucket has too many cached
+ * entries.
+ */
+ for (int i = MEMPOOL_SIZES - 1; i >= 0; i--)
+ {
+ while (pool->cache[i].entry)
+ {
+ MemPoolEntry *entry = pool->cache[i].entry;
+
+ /* we're within the limit, bail out */
+ if (pool->cache[i].ncached <= pool->cache[i].maxcached)
+ break;
+
+ pool->cache[i].entry = entry->next;
+ pool->cache[i].ncached--;
+
+ free(entry->ptr);
+ entry->ptr = NULL;
+
+ /* add the entry to the freelist */
+ entry->next = pool->freelist;
+ pool->freelist = entry;
+
+ Assert(pool->mem_cached >= block_size);
+
+ /* update accounting */
+ pool->mem_cached -= block_size;
+ }
+
+ /* double the block size */
+ block_size = (block_size << 1);
+ }
+
+ MEMPOOL_DEBUG("%d MemoryPoolEnforceMaxCounts allocated %ld cached %ld\n",
+ getpid(), pool->mem_allocated, pool->mem_cached);
+
+ AssertCheckMemPool(pool);
+}
+
+/*
+ * MemoryPoolEnforceSizeLimit
+ * Release cached blocks to allow allocating a block of a given size.
+ *
+ * If actually freeing blocks is needed, we free more of them, so that we don't
+ * need to do that too often. We free at least 2x the amount of space we need,
+ * or 25% of the limit, whichever is larger.
+ *
+ * We free memory from the largest blocks, because that's likely to free memory
+ * the fastest. And we don't alocate those very often.
+ *
+ * XXX Maybe we should free memory in the smaller classes too, so that we don't
+ * end up keeping many unnecessary old blocks, while trashing the large class.
+ */
+static void
+MemoryPoolEnforceSizeLimit(Size request_size, int index)
+{
+ int64 threshold,
+ needtofree;
+
+ Size block_size = MEMPOOL_MAX_BLOCK;
+
+ /* no memory limit set */
+ if (pool->mem_allowed == 0)
+ return;
+
+ /* nothing cached, so can't release anything */
+ if (pool->mem_cached == 0)
+ return;
+
+ /*
+ * With the new request, would we exceed the memory limit? we need
+ * to count both the allocated and cached memory.
+ *
+ * XXX In principle the block may be already available in cache, in which
+ * case we don't need to add it to the allocated + cached figure.
+ */
+ if (pool->mem_allocated + pool->mem_cached + request_size <= pool->mem_allowed)
+ return;
+
+ /*
+ * How much we need to release? we don't want to allocate just enough
+ * for the one request, but a bit more, to prevent trashing.
+ */
+ threshold = Min(Max(0, pool->mem_allowed - 2 * request_size),
+ pool->mem_allowed * 0.75);
+
+ Assert((threshold >= 0) && (threshold < pool->mem_allowed));
+
+ /*
+ * How much we need to free, to get under the theshold? Can't free more
+ * than we have in the cache, though.
+ *
+ * XXX One we free at least this amount of memory, we're done.
+ */
+ needtofree = (pool->mem_allocated + pool->mem_cached + request_size) - threshold;
+ needtofree = Min(needtofree, pool->mem_cached);
+
+ MEMPOOL_DEBUG("%d MemoryPoolMaybeShrink total %ld cached %ld threshold %ld needtofree %ld\n",
+ getpid(), pool->mem_allocated + pool->mem_cached, pool->mem_cached, threshold, needtofree);
+
+ /* Is it even eligible to be in the cache? */
+ for (int i = MEMPOOL_SIZES - 1; i >= 0; i--)
+ {
+ /* did we free enough memory? */
+ if (needtofree <= 0)
+ break;
+
+ while (pool->cache[i].entry)
+ {
+ MemPoolEntry *entry = pool->cache[i].entry;
+
+ pool->cache[i].entry = entry->next;
+ pool->cache[i].ncached--;
+
+ free(entry->ptr);
+ entry->ptr = NULL;
+
+ /* add the entry to the freelist */
+ entry->next = pool->freelist;
+ pool->freelist = entry;
+
+ needtofree -= block_size;
+
+ /* did we free enough memory? */
+ if (needtofree <= 0)
+ break;
+ }
+
+ block_size = (block_size >> 1);
+ }
+
+ MEMPOOL_DEBUG("%d MemoryPoolEnforceMemoryLimit allocated %ld cached %ld needtofree %ld\n",
+ getpid(), pool->mem_allocated, pool->mem_cached, needtofree);
+
+ AssertCheckMemPool(pool);
+}
+
+/*
+ * MemoryPoolSetSizeLimit
+ * Set size limit for the memory pool.
+ */
+void
+MemoryPoolSetSizeLimit(int64 size)
+{
+ Size blksize = MEMPOOL_MIN_BLOCK;
+ Size maxsize;
+
+ Assert(pool);
+ Assert(size >= 0);
+
+ pool->mem_allowed = size;
+
+ /* also update the max number of entries for each class size */
+
+ if (size > 0)
+ maxsize = size / MEMPOOL_SIZES;
+ else
+ maxsize = MEMPOOL_SIZE_MAX;
+
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ pool->cache[i].maxcached = (maxsize / blksize);
+ blksize *= 2;
+ }
+
+ /* enforce the updated maxcached limit */
+ MemoryPoolEnforceMaxCounts();
+
+ /* also enforce the general memory limit */
+ MemoryPoolEnforceSizeLimit(0, -1);
+}
+
+/*
+ * MemoryPoolGetSizeAndCounts
+ */
+void
+MemoryPoolGetSizeAndCounts(int64 *mem_allowed, int64 *mem_allocated, int64 *mem_cached,
+ int64 *cache_hits, int64 *cache_misses)
+{
+ Assert(pool);
+
+ *mem_allowed = pool->mem_allowed;
+ *mem_allocated = pool->mem_allocated;
+ *mem_cached = pool->mem_cached;
+
+ *cache_hits = 0;
+ *cache_misses = 0;
+
+ for (int i = 0; i < MEMPOOL_SIZES; i++)
+ {
+ *cache_hits += pool->cache[i].nhits;
+ *cache_misses += pool->cache[i].nmisses;
+ }
+}
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index ca7858d6b66..db94e74ccd6 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -179,4 +179,13 @@ extern MemoryContext GenerationContextCreate(MemoryContext parent,
#define SLAB_DEFAULT_BLOCK_SIZE (8 * 1024)
#define SLAB_LARGE_BLOCK_SIZE (8 * 1024 * 1024)
+extern void *MemoryPoolAlloc(Size size);
+extern void *MemoryPoolRealloc(void *pointer, Size oldsize, Size size);
+extern void MemoryPoolFree(void *pointer, Size size);
+
+extern void MemoryPoolSetSizeLimit(int64 size);
+extern void MemoryPoolGetSizeAndCounts(int64 *mem_limit,
+ int64 *mem_allocated, int64 *mem_cached,
+ int64 *cache_hits, int64 *cache_misses);
+
#endif /* MEMUTILS_H */
--
2.43.0
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :
Hi Tomas !
I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop
FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubtit's the only such allocation.
Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.
An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.
FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.
GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.
By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.
Best regards,
--
Ronan Dunklau
On 1/29/24 09:53, Ronan Dunklau wrote:
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :
Hi Tomas !
I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop
Thanks for reminding me. I'll re-read that thread.
FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubtit's the only such allocation.
Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.
No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.
An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.
Thanks, I'll take a look at that.
FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.
So, what values would you recommend for these parameters?
My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would
be using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.
It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.
An important part of glibc's malloc behaviour in that regard comes from
the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.Thanks, I'll take a look at that.
FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
using sbrk isn't freed as easily, and you don't run into a pattern of
moving the sbrk pointer up and down repeatedly. The automatic trade off
between the mmap and trim thresholds is supposed to prevent that, but the
way it is incremented means you can end in a bad place depending on your
particular allocation patttern.So, what values would you recommend for these parameters?
My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.
Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.
The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.
Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.
On 1/29/24 15:15, Ronan Dunklau wrote:
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would
be using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.An important part of glibc's malloc behaviour in that regard comes from
the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.Thanks, I'll take a look at that.
FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
using sbrk isn't freed as easily, and you don't run into a pattern of
moving the sbrk pointer up and down repeatedly. The automatic trade off
between the mmap and trim thresholds is supposed to prevent that, but the
way it is incremented means you can end in a bad place depending on your
particular allocation patttern.So, what values would you recommend for these parameters?
My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.
OK, so let's say I expect each backend to use ~90MB of memory (allocated
at once through memory contexts). How would you set the two limits? By
default it's set to 128kB, which means blocks larger than 128kB are
mmap-ed and released immediately.
But there's very few such allocations - a vast majority of blocks in the
benchmark workloads is <= 8kB or ~27kB (those from btbeginscan).
So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the
M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help,
because what I expect to happen is we execute a query and it allocates
all memory up to a high watermark of ~90MB. And then the query
completes, and we release almost all of it. And even with trim threshold
set to e.g. 8MB we'll free almost all of it, no?
What we want to do is say - hey, we needed 90MB, and now we need 8MB. We
could free 82MB, but maybe let's wait a bit and see if we need that
memory again. And that's pretty much what the mempool does, but I don't
see how to do that using the mmap options.
The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.
I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.
But there's then there's the problem that the mmap parameters don't tell
us how much memory to keep, but how large chunks to release.
Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.
Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.
Thanks for the feedback / insights!
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
I understand your point, I was basing my previous observations on what a
backend typically does during the execution.
The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.But there's then there's the problem that the mmap parameters don't tell
If we > > us how much memory to keep, but how large chunks to release.Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.
For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.
But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.
So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.
It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.
Setting any parameter disable that.
But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)
On 1/29/24 16:42, Ronan Dunklau wrote:
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.I understand your point, I was basing my previous observations on what a
backend typically does during the execution.The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.But there's then there's the problem that the mmap parameters don't tell
If we > > us how much memory to keep, but how large chunks to release.Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.Setting any parameter disable that.
Thanks. I gave this a try, and I started the tests with this setting:
export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))
which I believe means that:
1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations
2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)
3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)
Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.
first column - "master" (17dev) with no patches, default glibc
second column - 17dev + locking + mempool, default glibc
third column - 17dev + locking, tuned glibc
The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.
The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.
But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)
Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.
I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.
What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)
I think this is a reasonable approach. Some comments:
- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.
- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:
+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)
When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.
Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 6/24/24 17:05, Robert Haas wrote:
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.
Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.
That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...
What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)I think this is a reasonable approach. Some comments:
- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.
OK. I didn't realize global variables start a zero.
- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)
Yeah, definitely needs comment explaining this.
I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.
When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.
I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.
I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.
Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.
That's an excellent question. I don't know.
I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.
But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.
I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Jun 25, 2024 at 6:04 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Yeah, definitely needs comment explaining this.
I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.
Right. My guess is that if we try too hard to make the hash function
good, there will be a performance hit. Unlike, say, strings that come
from the user, we have no reason to believe that relfilenumbers will
have any particular structure or pattern to them, so a low-quality,
fast function seems like a good trade-off to me. But I'm *far* from a
hashing expert, so I'm prepared for someone who is to tell me that I'm
full of garbage.
I don't think I've heard the term "partially-associative cache" before
That's an excellent question. I don't know.I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.
Sounds good.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 6/25/24 12:04, Tomas Vondra wrote:
On 6/24/24 17:05, Robert Haas wrote:
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.I think this probably makes sense. I'm a little worried that we're
just kicking the can down the road here where maybe we should be
solving the problem in some more fundamental way, and I'm also a
little worried that we might be reducing single-core performance. But
it's probably fine.Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)I think this is a reasonable approach. Some comments:
- FastPathLocalUseInitialized seems unnecessary to me; the contents of
an uninitialized local variable are undefined, but an uninitialized
global variable always starts out zeroed.OK. I didn't realize global variables start a zero.
I haven't fixed this yet, but it's pretty clear the "init" is not really
needed, because it did the memset() wrong:
memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));
This only resets one byte of the array, yet it still worked correctly.
- You need comments in various places, including here, where someone
is certain to have questions about the algorithm and choice of
constants:+#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
% FP_LOCK_GROUPS_PER_BACKEND)Yeah, definitely needs comment explaining this.
I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.When I originally coded up the fast-path locking stuff, I supposed
that we couldn't make the number of slots too big because the
algorithm requires a linear search of the whole array. But with this
one trick (a partially-associative cache), there's no real reason that
I can think of why you can't make the number of slots almost
arbitrarily large. At some point you're going to use too much memory,
and probably before that point you're going to make the cache big
enough that it doesn't fit in the CPU cache of an individual core, at
which point possibly it will stop working as well. But honestly ... I
don't quite see why this approach couldn't be scaled quite far.I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
of 64 to say 65536, would that still perform well? I'm not saying we
should do that, because that's probably a silly amount of memory to
use for this, but the point is that when you start to have enough
partitions that you run out of lock slots, performance is going to
degrade, so you can imagine wanting to try to have enough lock groups
to make that unlikely. Which leads me to wonder if there's any
particular number of lock groups that is clearly "too many" or whether
it's just about how much memory we want to use.That's an excellent question. I don't know.
I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.
I finally got to do those experiments. The scripts and results (both raw
and summarized) are too big to attach everything here, available at
https://github.com/tvondra/scalability-tests
The initial patch used 64 (which means 1024 fast-path slots), I ran the
tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought
about testing with ~64k groups, but I didn't go with the extreme value
because I don't quite see the point.
It would only matter for cases with a truly extreme number of partitions
(64k groups is ~1M fast-path slots), and just creating enough partitions
would take a lot of time. Moreover, with that many partitions we seems
to have various other bottlenecks, and improving this does not make it
really practical. And it's so slow the benchmark results are somewhat
bogus too.
Because if we achieve 50 tps with 1000 partitions, does it really matter
a patch changes that to 25 of 100 tps? I doubt that, especially if going
to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k
partitions - how fast is that going to be?
So I think stopping at 1024 groups is sensible, and if there are some
inefficiencies / costs, I'd expect those to gradually show up even at
those lower sizes.
But if you look at results, for example from the "join" test:
https://github.com/tvondra/scalability-tests/blob/main/join.pdf
there's no such negative effect. the table shows results for different
combinations of parameters, with the first group of columns being on
regular glibc, the second one has glibc tuning (see [1]/messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com for details).
And the values are for different number of fast-path groups (0 means the
patch was not applied).
And the color scale on the show the impact of increasing the number of
groups. So for example when a column for "32 groups" says 150%, it means
going from 8 to 32 groups improved throughput to 1.5x. As usual, green
is "good" and red is "bad".
But if you look at the tables, there's very little change - most of the
values are close to 100%. This might seem a bit strange, considering the
promise of these patches is to improve throughput, and "no change" is an
absence of that. But that's because the charts illustrate effect of
changing the group count with other parameters fixed. It never compares
runs with/without glibc runing, and that's an important part of the
improvement. Doing the pivot table a bit differently would still show a
substantial 2-3x improvement.
There's a fair amount of noise - especially for the rpi5 machines (not
the right hw for sensitive benchmarks), but also on some i5/xeon runs. I
attribute this to only doing one short run (10s) for each combinations
of parameters. I'll do more runs next time.
Anyway, I think these results show a couple things:
1) There's no systemic negative effect of increasing the number of
groups. We could go with 32k or 64k groups, and it doesn't seem like
there would be a problem.
2) But there's not much point in doing that, because we run into various
other bottlenecks well before having that many locks. By the results, it
doesn't seem going beyond 32 or 64 groups would give us much.
3) The memory allocation caching (be it the mempool patch, or the glibc
tuning like in this test round) is a crucial piece for this. Not doing
that means some tests get no improvement at all, or a much smaller one.
4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or
perhaps even no effect at all.
Based on this, my plan is to polish the patch adding fast-path groups,
with either 32 or 64 groups, which seems to be reasonable values. Then
in the future, if/when the other bottlenecks get addressed, we can
rethink and increase this.
This however reminds me that all those machines are pretty small. Which
is good for showing it doesn't break existing/smaller systems, but the
initial goal of the patch was to improve behavior on big boxes. I don't
have access to the original box at the moment, so if someone could
provide an access to one of those big epyc/xeon machines with 100+ cores
for a couple days, that would be helpful.
That being said, I think it's pretty clear how serious the issue with
memory allocation overhead can be, especially in cases when the existing
memory context caching is ineffective (like for the btree palloc). I'm
not sure what to do about that. The mempool patch shared in this thread
does the trick, it's fairly complex/invasive. I still like it, but maybe
doing something with the glibc tuning would be enough - it's not as
effective, but 80% of the improvement is better than no improvement.
regards
[1]: /messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com
/messages/by-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com
--
Tomas Vondra
Hi,
While discussing this patch with Robert off-list, one of the questions
he asked was is there's some size threshold after which it starts to
have negative impact. I didn't have a good answer to that - I did have
some intuition (that making it too large would not hurt), but I haven't
done any tests with "extreme" sizes of the fast-path structs.
So I ran some more tests, with up to 4096 "groups" (which means 64k
fast-path slots). And no matter how I slice the results, there's no
clear regression points, beyond which the performance would start to
decline (even just slowly). It's the same for all benchmarks, client
counts, query mode, and so on.
I'm attaching two PDFs with results for the "join" benchmark I described
earlier (query with a join on many partitions) from EPYC 7763 (64/128c).
The first one is with "raw" data (throughput = tps), the second one is
relative throughput to the first column (which is pretty much current
master, with no optimizations applied).
The complete results including some nice .odp spreadsheets and scripts
are available here:
https://github.com/tvondra/pg-lock-scalability-results
There's often a very clear point where the performance significantly
improves - this is usually when all the relation locks start to fit into
the fast-path array. With 1000 relations that's ~64 groups, and so on.
But there's no point where it would start declining.
My explanation is that the PGPROC (where the fast-path array is) is so
large already (close to 1kB), that making it large does not really cause
any additional cache misses, etc. And if it does, it's far out-weighted
by cost of accessing (or not having to) the shared lock table.
So I don't think there's any point at which point we'd start to regress,
at least not because of cache misses, CPU etc. It stops improving, but
that's just a sign that we've hit some other bottleneck - that's not a
fault of this patch.
But that's not the whole story, of course. Because if there were no
issues, why not to just make the fast-path array insanely large? In
another off-list discussion Andres asked me about the memory this would
need, and after looking at the numbers I think that's a strong argument
to keep the numbers reasonable.
I did a quick experiment to see the per-connection memory requirements,
and how would it be affected by this patch. I simply logged the amount
of shared memory CalculateShmemSize(), started the server with 100 and
1000 connections, and did a bit of math to calculate how much memory we
need "per connection".
For master and different numbers of fast-path groups I got this:
master 64 1024 32765
---------------------------------
47668 52201 121324 2406892
So by default we need ~48kB / connection, with 64 groups we need ~52kB
(which makes sense because that's 1024 x 4B slots), and then with 1024
slots we get to 120kB etc and with 32k ~2.5MB.
I guess those higher values seem a bit insane - we don't want to just
increase the per-connection memory requirements 50x for everyone, right?
But what about the people who actually want this many locks? Let's bump
the max_locks_per_transactions from 64 to 1024, and we get this:
master 64 1024 32765
-------------------------------------
419367 423909 493022 2778590
Suddenly, the differences are much smaller, especially for the 64
groups, which is roughly the same number of fast-path slots as the max
locks per transactions. That shrunk to ~1% difference. But wen for 1024
groups it's now just ~20%, which I think it well worth the benefits.
And likely something the system should have available - with 1000
connections that's ~80MB. And if you run with 1000 connections, 80MB
should be rounding error, IMO.
Of course, it does not seem great to force everyone to pay this price,
even if they don't need that many locks (and so there's no benefit). So
how would we improve that?
I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?
Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.
At this point I think it's fairly clear we have significant bottlenecks
when having to lock many relations - and that won't go away, thanks to
partitioning etc. We're already fixing various other bottlenecks for
these workloads, which will just increase pressure on locking.
Fundamentally, I think we'll need to either evolve the fast-path system
to handle more relations (the limit of 16 was always rather quite low),
or invent some entirely new thing that does something radical (say,
locking a "group" of relations instead of locking them one by one).
This patch is doing the first thing, and IMHO the increased memory
consumption is a sensible / acceptable trade off. I'm not sure of any
proposal for the second approach, and I don't have any concrete idea how
it might work.
regards
--
Tomas Vondra
Attachments:
join-epyc-data.pdfapplication/pdf; name=join-epyc-data.pdfDownload
%PDF-1.7
%��������
2 0 obj
<</Length 3 0 R/Filter/FlateDecode>>
stream
x���]�&I��w��"���/���tW
-,h�{!t��`�,������f�5�??�z�**�~H��w��v�,��������r���v%�Q������������?��o��/����_�������o����������������[?�����Uj������.8U�������?��_~�����_~���I_Y�~���2����~^�������o��G�V���������\���5��������_���������������������������+G:F��{u���������O������~;�o������R�_������U�z0����~�F�5���|������������g�{����/�rKW������?�v~%��{������~h�N��?I��s�Y��������lk���gY���|U�=zj�w��K(��t�������;��w����������G?��d��?������(�:[XK���X�h�c�T{����y=�����+]w����-[������;������C���j�K�I����[�S��Z�~����3<G��������<�WS;���?�S{�i�O���Xr�^�'��C,y�t�����!�����O������C��t���Y���:�#�g��{����/�gm�����X��#���&j������rY�%�����3�.k�����]���K,{�+���I�~���t^�[?�t��r��X�]�>w��_b�z����E�/�h�Sn�g��n�r��z�S?K��o������_b��N��<��K,z��?�����;���,�����u�o��r���k��_=�T���C,��T��w'�%�w����w���;���e�%\IG�S;~���K��v���?[e�%�����Y�?;�t��G�~�E�������_�R[��{\?�������O��r����gKy������K�?�������|����[�����`���{ N��J)��r���3�=e<��ZN����.h��RK-E�}�B��UJ�*M)�g:)�/���0,�Ng��8,g�B���}^��������t���|��5\D��o���w������zw[����m��CJ�G����_m���5>��;������:�$}���?���j����W���w:����;�|�Q.����.�XF)4���0���[>K���rfu��#�g���R����UR�����������v��$����7���[t��n|��o��������������h-������
,�Z��*��}��@8��j������NO-W:��S[Nw�����Q�azj��^Z���3'��������N��?;=-)������H�}0=�^��MO;rp����^��A����#�7���E���s=(D��9rI}�������������pz�v��>������6�������N�q��ZPO�5�u����'����^�MO\8�izz�i����o�vz��)���O�tT-(��_G:
|��id-��3����izF���@���/�vz���L���[��3�{�K;L��J������7��9�H��0O�#�?;=g�i�0Xjz�����/��9{O0L�9�T���s>/o���.��{�C�����Q�( F�k������������q���2�(��ZF�� �����R���/����[n�G�^Z���������V���L����#�*Z�V�CU5fj�jO��������g��B��V�^)=��W�}�g}^��|m�{�^���#��_�-=k-�~������L
� f
�JN��-��
�j�%��P����K�����l����A,?r/�U���R=�����#P0�^��:Ri�4��{�V��g%H�{�^�A�u�j^�/^�Mn��j�nR�N�:$��]� ��3�3hA�].��p��Q�\��z�f0����p�t��T[��9�$B�A�c��D�W���G��r:���Al���w�C�9G?�y-��J��y��"s�F�����4�z����,��p�C����j�(��t�<���������������rP�<�z��R�m�T�\��z�����(�H��.<�Q�1\W��@�c�,�\R>��O�NAi��p��Q{::+���S"����8��q�����*��5�J
�R���1I�A�%k\#�[�w=�����
�~����i�-hx�: [���t���`�����ZZP-��N�`����v�%��<w�[+WKw
����T)�������C���k����H�$s/?��X��6�3Ukx��p=*���a�*\���i^
d����s�4/��a�f�:{�`G�.��>�cW������8�;1���ZkI�}4M����qx��u�����He*J�u ��R�����M�S?O�z�t6^�t�Z�}lF�R1�gy�U��3��Hy�I��w������p`+���VL8�U-��7�<�������U<gM�R�kR�h,4��v�T{p���.Q�]�q����|��K��Te��c1Y����]X�\Z�52<��'�
<���R����m�<Jt�D���}����g:O�c5������=�V�y��1�Tz��������#P�[���u��B�����F8+�x�p����G�v�|��x����V<K\��O�Y��j0��{�_g:.R1;2@���N�d��gyi�UJ���7��c�$r}B���|�������f��N���!m���5\�#�Q����B^{�Y�U���]����������h�}�r���H��\��|�p]��gZ����������9X����lOwZ����;������:O�9����*�����hR��Sw�>���tv$��=�t�����AQjM�c ���_=�p��D�[Pk{'X�r��L�J�-^_/J���3�D�Y-gY�Zh���D$�����������;�����+���d�Sx�k��^����Z4�l5g8�{�J�{,bn����j�i������,d}b�R2/��a���e�vy�<0�����-
/�������Q��������LWZ2�;��m�N�
�t���u�~�V� ���1���%�R{����cQ�1y�
0��E��zAj���E�u��g��m��d��${R�TV2�}PP�z���o��kE�v�3�J���5��%]w�dL�%�H�
��.��=�"G����^�"�Q�^�����H�x����J�:G���k��������<P����+ 8�2���{�H
��}�J�\�������_�����v>�k�G��"*G�
��~_����:��F�����yV�
F���]A3��F;�y����'��h��n���
�k�;��d =��J<;HgWK=sK6� $��������3��
�J�4_�]��E�����"H�i�n��O��/tg��G+Y���?I�Z��m����\�1��KN��R-Qw��|�)66�1�q�/tDX�O��Y�-������,���x��w����j�pw����@p063}`(�b3�|&��lfj@����I1���]��������Jd���� jg���yk������+�D��������������y�llf^���lf}4F��f�1 �����4��������lf�Fp�9�[ �9g3���:����� ����
OT�f�.h����Y�f,9���4 d�9�k!�98P�K���<E�#c3�n$|el��H����8������n63��6��f�; ��ff�������w��yh�����k��lf��P�����.HJ���y3���M�s63�
�q63��!�������u�����������H���f�N$ ���(`�����Do:��@o�f�ya���R9��\��YJ���L���f���[�X�����*d3���wt�b3�84x�663h�o/�������}��f�a����lf���0�a��f&�M_V��*�����;�9T�����yaS����y]`�8�9P��D����c�(����I�t�hhlf��A���������F7����#�����"�K���q,&'O�����ts63�b�KLflfP1��5#���������f�V� o5:�lf��Ig3 .��f����������xcl��k����hN��H�������"���v�����i������������lf^Z�
�f&3�Roc3�N-���q�f3��X��D����.��fF�>��>��f�Im`�����h����*2�����v��ew63��R�'c3����p63vd��Hflfj�&�I4����"������m�v��q8� :�9�3�f�� ��pm��U��6������D n7��L�,9a���-)�#���������_�3elfP��Mg3SK��63(y6����xw��yLf��y�f�)^$�����S<���lfl���ullf��
�������O�~7��gg�{����lfT2��������E%�J���N����A��������������������l��x�
�lf�2-��263wg^�5�363�d��h��lfR�wd�t5����������<�9�fZ��nZ�����_���v����8���-����o������WoM0u63Jy�5�����l����Tblf���� ����K�1g3sK&�q63+YB63������Fo6s`W�%��7�9�n�-v6sp{\fC���f�K�o��&��f3cw&_/g3��6�C���f�x��!�����9�����������yQ���f��l���9h���J$���|��z!���f�����J�����R���y�������NV�]y���q��_��^f))����D�_�XH�u��I|e�����>����;��&3�t6�^+�?{�+���[�+���'�a�������4���O��u�
��:C��
�e-�����c<f}Es3* ����q` ;��g!~�<��aVfn��9�L�([`<fZ�;�D��cZ�Q���Y��p3)�$�cZ����cF���<fZ�X��c�1�^�c��Hxx��y/��qi3�#u����D� ��k���E[��{�����^�T�z�v�t�@\[�~�4*��s�3����s���~�p�2���9s9�af��Vk��`���w3�� �7g.�=��<�4�^��b��\�Q�gc.c ������%p�r� �n�2* �k�e�F��8s�[ �ac.G]�1�yh��3�q3���h�aS>e�������J��|��d�e�a���^k�e�a����h�e�a� .��\�6�D!�}��e����Ho�d.c�������4c<@��b.��������JE^#�s��eP���P�v3�Qg@6�2��q��������x�s3����GDrts'�����#�9B�Rc.��E���S���3�����������B�2M��$����0��%�����8���)q�f.��.��D���L*�Jx.}���L*R*������
g.sG2�Aq����E�tU�\�u��=�������C���dM3��u�2�S�\F����\�VNg.�6��}�?m�e��������t�pAg.���R�H?�f.S+S,d.�YnT������Ba��\�Y.e����L*�\����f.�5[~5�`7sM���"����l�+0���7�����u3����p�2O*W9s�'�R�;s��T�����N��������;����S��������pm�����)&��B�Zv�[�z��j
m�_��t#[�a_;1|7�
��U��,@6��^,v���l�N �����P+^<
����+��U*V��
)��
+8z�b����-��e����-���M���T���0d3�+�Jw�![�b_u���![��E���0d�i���R+&�;�0���r^;�E+#k�liR��,B�0��n�LQ��pV�s���Td�L����s�{��u#[����V��d`,9����.P�l���
��l�� ������`�+��R�xG�|+�5d��X�z��32�L����v��!��l� ������U7�0���p�:��V�6Rtm��l���$�S���\���-����cr�����(�rd���Cz.C���J�}�RG�td3��@~��li��L_q�!9����a
��-]d�l���l����<���PL���[P����uG�JV"�+�yF}-������I���=���
7�0�����(�w����P,h��?�z-57����Jd=��X��f�:I|�-e��;eZq/[u�T����N�������)���h(>`��g�V/+���9�B�X���j��c�GkGjw0t��<F�W (�_�<��K�3�{�r�K z�T:���g������+{����TA�(w:+�c���n�-FB�x�H%��1BG����uI�q��:�D���)�$���~��Q������HPkl������Y\�n607^B���U`����c��Y� �y�r�8��;��D��9Cy�rM�dq*Z�t�@\�F��,���� >-+�)�5���1Z���������.�0�3�9�h�)?q�$��KM�
�5q��G:N�(���4�SN��JW�QR>P|�>[�
�3��^=5��t��g%�;�Q_K5�J�U GI��@����p@+�[%!�
fQ%�] �gD�]�p��NL��GW%�>`}��C�(��"H�Lo�U�RR@y�<�TuaXN����Z�)��H�z>Q�|=�Z�L���=����y$)����.��;����X�I
�9���<�<
��|��"���y�tRq!�yj�QS��-�f�y�X�"Ji=P�C�(���\�a��NL��z;yE�uo�44������������E�
9l�t����
�V!�`q���R3Y�T 1�BN`�4_�*�]����L���9�$��q���Jx�����kz6y1<w�&n����g�w%�M2�&������l�zA����M����Dc�o��������(�u�&]�������s���^�&v��Xx�)m��+q�� ���S�;����73�����z3q:;d$��8���x�����7���y��z$4Mw�S%��G!����Z<S�=��J|'3�����>LI|<�KY\���R-��["T*g�����J�U�F�A��_��|d���!������J6D�5l"���R���b#<�5��.!�������:�I�jWYO��3,���2�����j�Du��@cY��0��q(����`(�#�5����i ��J�Y�#
c83&��[j��0G�����J
��F�?A��?I��}�]gh���%�5�Xe�����ufy�}�S����Dy(ok�S��M��7�Icu�yZ�:��Nj��!�D�H�>9�<`�G�J~��/��y�����x���3��d�CN�6�.�Us�����(��z�LH�eP�#=t;������h�M�`N�����3�6�`|G��a��+
F�-#�>�:�)��CT%-��&�JAVw��G����`a���TkX{���@X"��N�����\sIg�����t�ZxZt����<g�����+���)(��t���:j���U�#]C��r#ab5�w�5m������0��iid�*-�z:o}��f���g�/`������e*�z��Y��:��gv�t,T�+��S�ZR�<`@�m���X�2�!{�)6$":��I,�`���5O7��5f��O]�O�����U��X���(��
(�y�OE�\�fk���S���$��LG&a����I�Kl?YH�����@�����������<����a��F�S�y��y�
�YM��"�Fa�|�#�0����(� �g.3��GG�Y���@X�J>��A�����ko����S���< a�{>�@x��3��*]����eHV�/Gc!�~�|2����EW���IaH��x�k�U]����z�����^V`��d;�Q�c� a$q���2!�f8.S����E9�� n��Z��]��E������x�`*_���U{�w�s��&��Iam><pQ��Ts\�SV���������9R���#�!5)����������E�4(�iA�r�r�b����*�Z��G�2�}�*^���G��L�,�=��E�8xs��E�j�h��Z���z���Te�Bu�)d�C��W ��v�"�k���.*����<t,��<t�,�~l��E�l�1���d��)f��4��u]$q"���E=�?�b���z�2��Y�"�7�sa����XZ�Ca����CAB�=t��=h��O����E�qP��Ba�P���.��FM���|�������Iap�9��H��5[����������r�l�����1��=�����`����=��N����i�d^8�kz�w��g�H����x��D5R|qi��?�t>�O%����7 �y ���y���P2��Wx����<�T�7By �,7��-���\�>�??�������y���!�� �g�w��"���E�R���Z��<�O%�o���uyN�'�fSr�#�N&�u��/�u f�=sz���#��ia���9%LOO���;������92@@�4<��s��\�D�4���0�Vr����~?p���)%B�'�S���80A�Zt��M�a�K��\�����J����1�y����O���u9���h�A�!�oR|�x�;,�y�?�����e����d����JC��r��<Syb���L%���:�c�]Ow*�W���W���}w���l5�����r�_f�NA�b-�)�<8:�_�������3�(R�� qG��x��(������;����t��u]����~6t�5�5Q�/���� �O��C��OJ�O�����+�NI��yI\G��P�Wx��%��/ ���g�{� �E��e$ 7�!�w�u�!�8(wi�C6�"�������E�0q �a�����&5^��<�����yY<���
��~v\�H���8�)��8�)�: ��������s�V&Lz���kj�C�j���w\�9d-�_�8�����'=�u�9����A�~v����<|��'�w9t�<�n��d���S���GHM]������v�G�`��������V�F>,�H]���d������
h�
?����]�{���Y���[��>��NN+�H�0����6P�nO3V���43�
��43��g8�&��z�RC�����r�{�^���W{�m���Z��aB�yO�
xo�l����R)
�l�4<�#��]��d%���yD
�h�V�548!=���U�Z� K9C�zf�S�������)g@�X^/ %w����@�+������;�l��Hb�]����W��c`��)������k;R&7T�>���
�C^��v�$���#��a������)@������44�&�������$_��t��=w�v������u���sWt�^4�y2���O������������zX �$���h`.2���b�|�~��r�S.C�K9v�����U���44�h`�S��l��{����}��
O�����?�'��:�����1��X�4�J�%/��5��,� ] ���)�Pj��C��
SQ���K���SAG���T�H��bU�
�*�X\��b?*��
�Fk
n��Vy4qB
���������p�u� +/,N��+�s��ku`Gf��v�m��*���h
ZA��Pk������VTQ0Q���`u~��i��[1�:��j��\��K����]I��_�Ke+�e�^�������+[��`�#1~e3�a��
��ES�e��)����|�"c�@l�(�Q,���`,����M�����A���d���o0#�dyiM���?n(L*�^7,�V��]zF�Fg'������d)���y��#��M�%@�g�o+�hqFf��&�nH��
���N�������
�E�@�
k�#�\��������m0#<%��{�O����]#D�'�����U��M��LHzvI�5#��8�A#�����a��������_�&��8��M���kb�v���%@a����&�b�A|v��5� �(\�M����uvq����]:�a�]��bhvy��4?8t�%hvI������m�.��������&���iB�v�x��NF�Eqme���/P���l��A��|\��o�.� Mg2�n�h��&������������]���1�.��Oxv����o�.}���l�.-Z8������l����#�~������:�k��]����n�.ZMN1�n�w��M�
,�����7�0D��y�8#���R{�6a7��I��&�r�����Z(G~vy�?.�������a#��q�T�*�2v����/*�
;�����4y��Rr[��P�EK��P���.5l(5�YCX�?4HM�q(i���:��CA�w^(NF�*����4xTz�������C�����p�{���*��Z�N��j�K�g���Jc1.������g ��0�O��?U�#�?�z������v��Cq8����I]��u�"������^��e����?UP��
���$�?�q��E���P�����@Q��F����C��?����x�������TX�����jK�?�U������������p��w�?��8�?�U �z��t�r�E���o�����h�����C��.���F�^�^(������CAE����C���N�?4��J��PD"�P}���:P�)����`F�J���Cq]4�m�?���;���i��JD�L_!�^2�Cb��4��S����-hX-���R2C�8��2d��0y(���Q���ly*������G[�5���T�U� B����9��*6m�}���,C��-��A9��b+�v���[A�-�la8�|h�lu+��o���#[�6����-��v��-���l!U��i����[1�������.����v�1b�:�o���F��b�(L5P1�C�li8�//�{m�V���O�
�r+(���-��T�Es�u��r+��
��Q����lA���i�F�����#[���G�<�����l�L�|
��p��O���-��>G��E[3G����y����-�(���wN,���mp��'=G�xi\y"dKf�V�rd���B���-^&UW�kC�8#3�C��!����s}:�����l�8#�l��F
c�^#��|�l���ZR������+�����8UG1����3w�/}]8�������x����K����\_%�����Z<u��� N�m���'o�q}A.�������y��\l\_yH����8��Oi\_:ps��W]������n\_��U������n�/��]����&��J�e�$r�\_�����4tz�;���J����C����e#%����%q���\_�2��������~���
�r��%q-�\_�2���v��5�/}���l�/)
����Ge��7��un�/�w-n\_�De=Q\�w���� �z��07��+��5�/^���h�������8q�is}��j�q}����yy�5�3��g[���h�c��R+��q}{Z��3�G��3�+����6��6�
-��(Z�zso���\� ��7��0]_�R���Rf+�/6����}�6<7��Nb�����_�z$;1�f����PP���u�P��j�d�/�BQ�{}����V��,����c��+�B
����N�Z���Pe���b+qk��*:�;x���b��������5
�����`+�t����ge�m� �h�>Pj�*?�@qB+1���cAy��*U�������TlNT�
��~A}�`~:�}z}��!���l�<�>P4�Z��gd�a�^�����C(��y3����G�&K��T,�F��KW�t(�"S���
32Nu���@A�J�%���@�#�����@��$�j>P:;3�bx}�x�i�����3��x}��S3`��
���u�������w*q����$���>PlE'O��@��W5C�����}mR5�waY!;�,��aY)���$��XV��H��=�=�XVkXL_�
������R3���5,,����FXV��<] �����`Pa���+5%2,*j�%�eIE�Cf���L:��
����R�FX����9�� ����e����{ �����b/�����(���XU ���,�x������+�*��<����,%�`
��8��,R�a�`(��X�U`Q���K�:���wF����aYP�J�X���M���E+����`�V�?�5�X�;2�.�aY��3������,����F���b,�cY4��b�����8C,��/�eIE]��#,�����
�����aY��}EFX�;B�F�������X����Dt�e�V�G�e�b�B�#,M�v�8�%+�%�T�����Q=�`8�b�cYN����#����F;U�N����-�� ��q����J}�Q_��v:45���0��^�}|*<z*��
��gj5����u�^�/K\�K:���������3@�(�Hy�9J��]��$)j���e(�:j=��>�t���N=SnZ�u��N���1?oIGaa���-���.���y�)B�K�8�}��c%�����e����������z�,]��H��}������������Ra��}�[�0��GO��I�����i�����{��aJap\5���!���!�Q~R@�0��>o�,������������������C��q�#�=>YH����$x`a
.����ly^a�I�������/ ���?o�,�=����p�0�n��
$���z^wQ���y�%aX$gy^tYr�>o��g�q����>�''���O6m���y�������?��������R9k���k~�����Q��",��b>HL���'����^6�����OF�o���4�I��C�.���U{�+���" i�)[�}% =���c]�����*s��� �� ��tvV�<�z�t�r���H����������f!i��?j����������Pb���[+������)_� o�������O;�tR���k0�^G��`?^�#(��Eh���i�p�+��c��F[NG�PV Z�������U@^#�~��fVx������U�]8�T�@��.�%5�����inrK���B<p/=�G� b�z=��"�L-8r*�.+�H����
)x�8}���d��5�}�r�V�u
d���J�E
�R�����|��E��8��x�
�!�p��t�@�VG;�8�Y�\���5_V��LW�1j�U+�E4;h�G���t:r����-h������
��|
�JN��]�����t��,hl�z*��`�=<S�<��m A����a��FM�?���F
C�8[�k� ���=R����M�H3�\�����W��_W�Y{�&��T�����$��0�
5e�3�;8���H�w���+��:�� S�R��~��;����H���a�j������b�wF|�4��l�$q��LP@>�f�.���$���}�0B��>zp ���RZ�SO�BH!��
*%�y!&(���BLXv����]����7�F�/��'��%.aH4���v!)pxI-���/u&�1��jW����:��
V_/a'
�sxI��|�H�����>^�A�b//�*D�?��� �_x)pA�^�f*T���%) N�/A� g�/�`�^x�ga�Ip��R*��C�%
^�i���/������d� ��/AA���K��T�����S�%M#�%�F��_x����,
^�: R�/��A���Km\1��(�cL��W��H�RZ"��2�A�����%����9v>��#�>�<��\?��K�4/@Am��O��S�q-��)g����;�N���h/����#����9R;>������w���(����t��DkO%�j�R�g�7.DM4����Fu�4��D������5R����W��G�G��h����F�L��*tB��3���h�S�?=Kj�G���������=�l4$;2���}������W[M�}���H�~dZ{�,$
��+��3���|&:��}$:rO���h9����;]�3����W&�O��j�{�$��+�N�s�Q>�s�2�jQ���?-=��3�z�v~���������K:�g�����K�=G*�3��J�~$z��:�)����v~yI���'D�]����0!��_^���ia�2��zJ��/x��&"�������[��m�����@X��,\O�64����0�7Z��\�T�������>�G�2����}�<=\�#a�����g���J3�=\O�m�:�p=m�l�����X�x�v�CI����!\OO>�p=�U�<i�zz����'�!����`�t�-\����p=)��]�a������.;\�M�'�������>���#a���p=)V����������
/,\@���p���-\�s�G�]���z [�������]w��G[�g�a���p=���p=�P%q����fEZ��aPc��}b�,\��;�������TyV�+�����rw���D�t8��az����L�pL4����pL|�Sp��1-����>�����&Jc0Q��h`LO�v0&]����1!������O���������'���0y���T�J���Ti����h���crO�C�c0`����R{J#�bzOi@�P�0s
��_�bj� Y�P�����,N��2(&�\`7�PL +����,"����3��m���_(��1J��������PL
�����/��v(&��~~���U:u������v����0�bzmka�b0Ua��*�'�S� �iK��
�����S@�e��gNa��s(�,��PLSp���TAjO�bp��� 7��Jd�T�yz���x��l��J�3�~|k�N�y~+c�F��a��o��_��������o���2�����L)�5��[�Q�������o���c#������|>�[��|��n�ZX�u��sU��r3�RC���XD�������mA���A�3v �7peX ���fx`� ��;�(�a�2�����B�A$����8������e��{ `���;�O��f�|3�J�i���5��`@�`�
�����?<��@GMx0�������.P@����A��
����4�!��B�������-v@�nA�$�H��7*�h�x����H�F�; ��4p'Z@ *���5�< 0������lX�y�������4������t���@:�, �Pp��^�K����m�F
�|3��< �<�X@`p�j�����+v@ ����h�`��4"���|���D�i�*����U*v@ �:��#����EP%QW��`�\�=���98���������,#�\�
+G*`�Uq����b�^PV�EP0�OA����GM��Z�c�]�IW��5q.�J��\�&�6Z�ER�T���5q,���t�\d�`��������yYT�U�X�����R�JLD�W��"�X%����.RG:W���tZ��[^w�[����]��s�
*����
�T�C��G��"�C���"/�A�W���|����t�^��+TbC����J�z��HG�C4���V{�O��p�{�E6����k/����#^{12�������X@Q8��H*2����h�g���f�^d�X �x�E�r�y�Ej��ua��,��L5Z�EV��r�X�E���^{12�����"�"cV{��Xvpz�E�f�3u�^.T$�j/��L>{��D���k/�X��r�ew�E��w�zx���Q���^�u��q�XX�EV��T��^�����"^QVQ������+`M������NaD���t��F)�o����g*)y���>��;��JV�n���,+�[2�����������
j��}%��,'�v��k��1�F^���J��d�����u� @%�(�h��Hc�?8���4v���dF�����N��Z�zL��� �bk��n+� ���(�![�+� ���N�$�s% %; ��m������y
k`���c�8�+������g�>��^ �f�f9;���(YY�4���y%*���nN�����<��X���OF��Fk����z<�
h�v��
���'KW�v.|}8zI��K���8R����I$��BF�����<N���J��l��vA��F����{��:X�IZI�)7V����O��)Y�f5:|��PI����d����L���-�w�N���J����0��d��JH�|�r��u�b�Qrj�����u6�Qk�=l�B�: ���#P�* �<��X�.Q�c8��Y����T���i<9�J4&������aL2��{���c��3���"%�x�]���������V2�}�*LG::)Y�V�V��G+�i�5��W�Z��rU���^V����+�p�TK4�+�P�����1��QG����������(@I��f��&M������,���0�b3��$�163�_?:�:0�F!���_��I
:���
���D����-�����z�k��&o63�64�f�} ������(����IA�����y)C@���y �����"#�9�r���c )g3��j663���563���?�lf� �f�3�r8�����f�VbX��f��:�����l���j����q)SN'c3��V`l��P���<:G����H��"�f�Y����f����f�������,HHhlfT@I����G�Rc3:��f��-���8T����l�����Y$�263+��)g3G'��o63���6����K�7�[@
����4`t��qu63�^H�f���[�H���O��]d�����\\f�`���Q�����5�\Lf� �;������`3�%�\<f�`�]�i�R�����k�,���� )zv���$�d�=�9�R�����,5<��(��`F
D�s����
���T�u;���4�h���L���������A�m�����\�m5�2�b� yn��>��3d���Z��2J�h�e8�2f9q����Q+��f-��P�ivV����Y��l���)���*&Rr�2h��J�`#,�q��=}�s�2�CY/������&3�
:[lv�<�NV�Y�]�����2�f�I�Te�����I$�����f�8Qv�r�����S����R�i�)��G�^7�2����I��7'��������� �(���,��P&��)���v���K�l~2�$�n�'�3j�c���t�<{K"#'� S�Z���M�#�R`��d��3�G�L�����nb2�@�g_��4^2��y�*����s��3F���J�A�,��9�<����@J���uG�I��d������dT2!dJ2R2wg,��D����t�~���`#%sK:�9)Z�tdqE%|��d�N��KFJ����"a�&%sK�u�I��dU
=���I�t�NJ���K���������rv���m�����;�'%SK�2�)9����"�&%���-�C3�����RE���yL:�&���c�fa9&FJ����j:��&%cwH��I��i_(����YI���NJf%eEJ��I�8�;�#%���H�p#%SK�
��\�����J�
���JV@7$���d�� �I�AK�D"�MJ,��<��j���dm�I�8���RB�MJ�����NJ��L1NJV��!��&%Gw6rU9)9��d��NJ�Z2 R�I�8&�e�NJ����;FJ�jMwsR2*�G��68)�-`^�}$v��d�� ������I���MJ��X�����tg�/]�u���+(e�2Rrt?)@BrRrt?����d���b���F:��}"rKI���T$.W��wj]�/O
�X.�?�[��!��-���1�rOw%��t$��R),P���Z��+#nLG�3>aD�,�������,������-��NG�1h�����
(Q���qCo:2�� ���#c���tdV �K��M�Nh���8��5:2�D�t:r� N��+�����-�$kNGQl�#��j8���
����c�xx��y ���qi3:��:��q�4,��u�U�wM����JX�{���1Y���A�+DZ��;-�C���Feq��N@�NA�/5rt
�p�M@�H�n���i�e��J�ed>�l���A�N� ���x�)q2)��N@�3H'�prd
C2u���L)C���]���F@��V`���J�kdn�~rr4�:Tv�a0����r����)[:�a�1YH>h|=� �\d�m#�k'�q�U��MJ4�MB&
#s��L�
{�e�6Y+�$fpiYN�$P\�1�I��K����J�j���r�E�G��IE����FZA*���cQ �f�$���d��k#I�f�d���
{nJ2���+�9����FJ�u�(����y]�_I�L�������A����I�FL�����Ht3�������65T���>���L��>p99�8%
pv2������MOf?(����l�x/��L**|�PF+�P��MQF���0���c��n�$e2�3A��,e��
��;M����@�����J�a#*����'�n�2o3*��TeV1��q����u��*��*V�
����
4�*���`_��_�G�����r�2�T�m��e����s�Q�lIHZ�I-��w�2�E�PF[�I��0�2ud�%���e�H�R��ks*���\b.od����#9��!��F�������m�N9�
���6h�l�
����l�
�.Gr#[��[�����\�[!�����`���~���z,ud�T�� `���Tl "[P����<�^d4(���V�XY��K�����b�P�)G��bE�KV�#[�\/��VT����&�g�,�"[���z�E�4��P�"[=A%���VL�����I]Y���j�&�b����P2�E�0#�� ��V��Gi��#��i���l������U����l���>E�Je�b��WG���18vd,�_��E�����B+8���lA�{^dK�h���������"[�b3$����>;�-G��#[9�m�h�l�����V�b�$sd�:�_&D�����^dK;uU� �-.�A1��l��]�SR�����"[�.6]]C����b���#]_T�������W�� ���z��86��X ��1�uW�/k�a���������i�I;k�Z��b�����/�����H�BiJG�F�N�^$��S�u60{%����okd��q���f�z
�a��g_V�FBd+9���#M�$kM5��!��#��M��c�~c����J��-�\[gN��g��7����J��i\�����������ISX�Q�tV��p*��[[s�ILz���&�m���o�7�;]��vH���_2���m����K� �o�/Jk����hA%@3��G�������k0Ux��������
�+hFB����
g"