LWLock optimization for multicore Power machines
Hi everybody!
During FOSDEM/PGDay 2017 developer meeting I said that I have some special
assembly optimization for multicore Power machines. From the answers of
other hackers I realized following.
1. There are some big Power machines with PostgreSQL in production use.
Not as many as Intel, but some of them.
2. Community could be interested in special assembly optimization for
Power machines despite cost of maintaining it.
Power processors use specific implementation of atomic operations. This
implementation is some kind of optimistic locking. 'lwarx' instruction
'reserves index', but that reservation could be broken on 'stwcx', and then
we have to retry. So, for instance CAS operation on Power processor is a
loop. So, loop of CAS operations is two level nested loop. Benchmarks
showed that it becomes real problem for LWLockAttemptLock(). However, one
actually can put arbitrary logic between 'lwarx' and 'stwcx' and make it a
single loop. The downside is that this logic has to be implemented in
assembly. See [1] for experiment details.
Results in [1] have a lot of junk which isn't relevant anymore. This is
why I draw a separate graph.
power8-lwlock-asm-ro.png – results of read-only pgbench test on IBM E880
which have 32 physical cores and 256 virtual thread via SMT. The curves
have following meaning.
* 9.5: unpatched PostgreSQL 9.5
* pinunpin-cas: PostgreSQL 9.5 + earlier version of 48354581
* pinunpin-lwlock-asm: PostgreSQL 9.5 + earlier version of 48354581 +
LWLock implementation in assembly.
lwlock-power-1.patch – is the patch for assembly implementation of LWLock
which I used that time rebased to current master.
Using assembly in lwlock.c looks rough. This is why I refactored it by
introducing new atomic operation pg_atomic_fetch_mask_add_u32 (see
lwlock-power-2.patch). It checks that all masked bits are clear and then
adds to variable. This atomic have special assembly implementation for
Power, and generic implementation for other platforms with loop of CAS.
Probably we would have other implementations for other architectures in
future. This level of abstraction is the best I managed to invent.
Unfortunately, I have no big enough Power machine at hand to reproduce that
results. Actually, I have no Power machine at hand at all. So,
lwlock-power-2.patch was written "blindly". I would very appreciate if
someone would help me with testing and benchmarking.
1. /messages/by-id/CAPpHfdsogj38HTDhNMLE56uJy9N8-
=gYa2nNuWbPujGp2n1ffQ@mail.gmail.com
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
lwlock-power-1.patchapplication/octet-stream; name=lwlock-power-1.patchDownload
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
new file mode 100644
index c196bb8..548518c
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
*************** GetLWLockIdentifier(uint32 classId, uint
*** 715,720 ****
--- 715,770 ----
return LWLockTrancheArray[eventId];
}
+ #if (defined(__GNUC__) || defined(__INTEL_COMPILER)) && (defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__))
+
+ /*
+ * Special optimization for PowerPC processors: put logic dealing with LWLock
+ * state between lwarx/stwcx operations.
+ */
+ static bool
+ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
+ {
+ uint32 mask, increment;
+ bool result;
+
+ AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+
+ if (mode == LW_EXCLUSIVE)
+ {
+ mask = LW_LOCK_MASK;
+ increment = LW_VAL_EXCLUSIVE;
+ }
+ else
+ {
+ mask = LW_VAL_EXCLUSIVE;
+ increment = LW_VAL_SHARED;
+ }
+
+ __asm__ __volatile__(
+ "0: lwarx 3,0,%4 \n"
+ " and 4,3,%2 \n"
+ " cmpwi 4,0 \n"
+ " bne- 1f \n"
+ " add 3,3,%3 \n"
+ " stwcx. 3,0,%4 \n"
+ " bne- 0b \n"
+ " li %0,0 \n"
+ " b 2f \n"
+ "1: li %0,1 \n"
+ #ifdef USE_PPC_LWSYNC
+ "2: lwsync \n"
+ #else
+ "2: isync \n"
+ #endif
+ : "=&r"(result), "+m"(lock->state)
+ : "r"(mask), "r"(increment), "r"(&lock->state)
+ : "memory", "cc", "r3", "r4");
+
+ return result;
+ }
+
+ #else
+
/*
* Internal function that tries to atomically acquire the lwlock in the passed
* in mode.
*************** LWLockAttemptLock(LWLock *lock, LWLockMo
*** 787,792 ****
--- 837,844 ----
pg_unreachable();
}
+ #endif
+
/*
* Lock the LWLock's wait list against concurrent activity.
*
lwlock-power-2.patchapplication/octet-stream; name=lwlock-power-2.patchDownload
diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
new file mode 100644
index 86b5308..533f252
*** a/src/backend/port/atomics.c
--- b/src/backend/port/atomics.c
*************** pg_atomic_fetch_add_u32_impl(volatile pg
*** 158,160 ****
--- 158,243 ----
}
#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
+
+ #if (defined(__GNUC__) || defined(__INTEL_COMPILER)) && (defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__))
+
+ /*
+ * Optimized implementation for Power processors. Atomic operations on Power
+ * processors are implemented using optimistic locking. 'lwarx' instruction
+ * 'reserves index', but that reservation could be broken on 'stwcx.' and then
+ * we have to retry. Thus, each CAS operation is a loop. But loop of CAS
+ * operation is two level nested loop. Experiments on multicore Power machines
+ * shows that we can have huge benefit from having this operation done using
+ * single loop in assembly.
+ */
+ uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask, uint32 increment)
+ {
+ uint32 result;
+
+ __asm__ __volatile__(
+ "0: lwarx %0,0,%4 \n" /* read *ptr and reserve index */
+ " and 3,%0,%2 \n" /* calculate '*ptr & mask" */
+ " cmpwi 3,0 \n" /* compare '*ptr & mark' vs 0 */
+ " bne- 1f \n" /* exit on '*ptr & mark != 0' */
+ " add 3,%0,%3 \n" /* calculate '*ptr + increment' */
+ " stwcx. 3,0,%4 \n" /* try to store '*ptr + increment' into *ptr */
+ " bne- 0b \n" /* retry if index reservation is broken */
+ #ifdef USE_PPC_LWSYNC
+ "1: lwsync \n"
+ #else
+ "1: isync \n"
+ #endif
+ : "=&r"(result), "+m"(*ptr)
+ : "r"(mask), "r"(increment), "r"(ptr)
+ : "memory", "cc", "r3");
+ return result;
+ }
+
+ #else
+
+ /*
+ * Generic implementation via loop of compare & exchange.
+ */
+ uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask_, uint32 add_)
+ {
+ uint32 old_value;
+
+ /*
+ * Read once outside the loop, later iterations will get the newer value
+ * via compare & exchange.
+ */
+ old_value = pg_atomic_read_u32(ptr);
+
+ /* loop until we've determined whether we could make an increment or not */
+ while (true)
+ {
+ uint32 desired_value;
+ bool free;
+
+ desired_value = old_value;
+ free = (old_value & mask_) == 0;
+ if (free)
+ desired_value += add_;
+
+ /*
+ * Attempt to swap in the value we are expecting. If we didn't see
+ * masked bits to be clear, that's just the old value. If we saw them
+ * as clear, we'll attempt to make an increment. The reason that we
+ * always swap in the value is that this doubles as a memory barrier.
+ * We could try to be smarter and only swap in values if we saw the
+ * maked bits as clear, but benchmark haven't shown it as beneficial
+ * so far.
+ *
+ * Retry if the value changed since we last looked at it.
+ */
+ if (pg_atomic_compare_exchange_u32(ptr, &old_value, desired_value))
+ return old_value;
+ }
+ pg_unreachable();
+ }
+
+ #endif
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
new file mode 100644
index c196bb8..ec3bbc3
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
*************** GetLWLockIdentifier(uint32 classId, uint
*** 727,790 ****
static bool
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{
! uint32 old_state;
AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
/*
! * Read once outside the loop, later iterations will get the newer value
! * via compare & exchange.
*/
! old_state = pg_atomic_read_u32(&lock->state);
! /* loop until we've determined whether we could acquire the lock or not */
! while (true)
{
! uint32 desired_state;
! bool lock_free;
!
! desired_state = old_state;
!
! if (mode == LW_EXCLUSIVE)
! {
! lock_free = (old_state & LW_LOCK_MASK) == 0;
! if (lock_free)
! desired_state += LW_VAL_EXCLUSIVE;
! }
! else
! {
! lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
! if (lock_free)
! desired_state += LW_VAL_SHARED;
! }
!
! /*
! * Attempt to swap in the state we are expecting. If we didn't see
! * lock to be free, that's just the old value. If we saw it as free,
! * we'll attempt to mark it acquired. The reason that we always swap
! * in the value is that this doubles as a memory barrier. We could try
! * to be smarter and only swap in values if we saw the lock as free,
! * but benchmark haven't shown it as beneficial so far.
! *
! * Retry if the value changed since we last looked at it.
! */
! if (pg_atomic_compare_exchange_u32(&lock->state,
! &old_state, desired_state))
! {
! if (lock_free)
! {
! /* Great! Got the lock. */
#ifdef LOCK_DEBUG
! if (mode == LW_EXCLUSIVE)
! lock->owner = MyProc;
#endif
! return false;
! }
! else
! return true; /* someobdy else has the lock */
! }
}
- pg_unreachable();
}
/*
--- 727,772 ----
static bool
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{
! uint32 old_state,
! mask,
! increment;
AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+ if (mode == LW_EXCLUSIVE)
+ {
+ mask = LW_LOCK_MASK;
+ increment = LW_VAL_EXCLUSIVE;
+ }
+ else
+ {
+ mask = LW_VAL_EXCLUSIVE;
+ increment = LW_VAL_SHARED;
+ }
+
/*
! * Use 'check mask then add' atomic which actually do all the useful job
! * for us.
*/
! old_state = pg_atomic_fetch_mask_add_u32(&lock->state, mask, increment);
! /*
! * If state was free according to the mask, we assume that operation was
! * successful.
! */
! if ((old_state & mask) == 0)
{
! /* Great! Got the lock. */
#ifdef LOCK_DEBUG
! if (mode == LW_EXCLUSIVE)
! lock->owner = MyProc;
#endif
! return false;
! }
! else
! {
! return true; /* somebody else has the lock */
}
}
/*
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
new file mode 100644
index 2e2ec27..4ec0219
*** a/src/include/port/atomics.h
--- b/src/include/port/atomics.h
*************** pg_atomic_sub_fetch_u32(volatile pg_atom
*** 415,420 ****
--- 415,433 ----
return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
}
+ /*
+ * pg_atomic_fetch_mask_add_u32 - atomically check that masked bits in variable
+ * and if they are clear then add to variable.
+ *
+ * Returns the value of ptr before the atomic operation.
+ *
+ * Full barrier semantics.
+ */
+ extern uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask_, uint32 add_);
+
+
/* ----
* The 64 bit operations have the same semantics as their 32bit counterparts
* if they are available. Check the corresponding 32bit function for
On Fri, Feb 3, 2017 at 8:01 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:
Unfortunately, I have no big enough Power machine at hand to reproduce
that results. Actually, I have no Power machine at hand at all. So,
lwlock-power-2.patch was written "blindly". I would very appreciate if
someone would help me with testing and benchmarking.
UPD: It appears that Postgres Pro have access to big Power machine now.
So, I can do testing/benchmarking myself.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, 2017-02-03 at 20:11 +0300, Alexander Korotkov wrote:
On Fri, Feb 3, 2017 at 8:01 PM, Alexander Korotkov <
a.korotkov@postgrespro.ru> wrote:Unfortunately, I have no big enough Power machine at hand to
reproduce
that results. Actually, I have no Power machine at hand at
all. So,
lwlock-power-2.patch was written "blindly". I would very
appreciate if
someone would help me with testing and benchmarking.UPD: It appears that Postgres Pro have access to big Power machine
now.
So, I can do testing/benchmarking myself.
Hi Alexander,
We currently also have access to a LPAR on an E850 machine with 4
sockets POWER8 running a Ubuntu 16.04 LTS Server ppc64el OS. I can do
some tests next week, if you need to verify your findings.
Thanks,
Bernd
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 3, 2017 at 12:01 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
Hi everybody!
During FOSDEM/PGDay 2017 developer meeting I said that I have some special
assembly optimization for multicore Power machines. From the answers of
other hackers I realized following.There are some big Power machines with PostgreSQL in production use. Not as
many as Intel, but some of them.
Community could be interested in special assembly optimization for Power
machines despite cost of maintaining it.Power processors use specific implementation of atomic operations. This
implementation is some kind of optimistic locking. 'lwarx' instruction
'reserves index', but that reservation could be broken on 'stwcx', and then
we have to retry. So, for instance CAS operation on Power processor is a
loop. So, loop of CAS operations is two level nested loop. Benchmarks
showed that it becomes real problem for LWLockAttemptLock(). However, one
actually can put arbitrary logic between 'lwarx' and 'stwcx' and make it a
single loop. The downside is that this logic has to be implemented in
assembly. See [1] for experiment details.Results in [1] have a lot of junk which isn't relevant anymore. This is why
I draw a separate graph.power8-lwlock-asm-ro.png – results of read-only pgbench test on IBM E880
which have 32 physical cores and 256 virtual thread via SMT. The curves
have following meaning.
* 9.5: unpatched PostgreSQL 9.5
* pinunpin-cas: PostgreSQL 9.5 + earlier version of 48354581
* pinunpin-lwlock-asm: PostgreSQL 9.5 + earlier version of 48354581 +
LWLock implementation in assembly.
Cool work. Obviously there's some work to do before we can merge this
-- vetting the abstraction, performance testing -- but it seems pretty
neat.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Feb 3, 2017 at 11:31 PM, Bernd Helmle <mailings@oopsware.de> wrote:
UPD: It appears that Postgres Pro have access to big Power machine
now.
So, I can do testing/benchmarking myself.We currently also have access to a LPAR on an E850 machine with 4
sockets POWER8 running a Ubuntu 16.04 LTS Server ppc64el OS. I can do
some tests next week, if you need to verify your findings.
Very good, thank you!
I tried lwlock-power-2.patch on multicore Power machine we have in
PostgresPro.
I realized that using labels in assembly isn't safe. Thus, I removed
labels and use relative jumps instead (lwlock-power-2.patch).
Unfortunately, I didn't manage to make any reasonable benchmarks. This
machine runs AIX, and there are a lot of problems which prevents PostgreSQL
to show high TPS. Installing Linux there is not an option too, because
that machine is used for tries to make Postgres work properly on AIX.
So, benchmarking help is very relevant. I would very appreciate that.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
lwlock-power-3.patchapplication/octet-stream; name=lwlock-power-3.patchDownload
diff --git a/src/backend/port/atomics.c b/src/backend/port/atomics.c
new file mode 100644
index 86b5308..55a9910
*** a/src/backend/port/atomics.c
--- b/src/backend/port/atomics.c
*************** pg_atomic_fetch_add_u32_impl(volatile pg
*** 158,160 ****
--- 158,243 ----
}
#endif /* PG_HAVE_ATOMIC_U32_SIMULATION */
+
+ #if (defined(__GNUC__) || defined(__INTEL_COMPILER)) && (defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__))
+
+ /*
+ * Optimized implementation for Power processors. Atomic operations on Power
+ * processors are implemented using optimistic locking. 'lwarx' instruction
+ * 'reserves index', but that reservation could be broken on 'stwcx.' and then
+ * we have to retry. Thus, each CAS operation is a loop. But loop of CAS
+ * operation is two level nested loop. Experiments on multicore Power machines
+ * shows that we can have huge benefit from having this operation done using
+ * single loop in assembly.
+ */
+ uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask, uint32 increment)
+ {
+ uint32 result;
+
+ __asm__ __volatile__(
+ " lwarx %0,0,%4 \n" /* read *ptr and reserve index */
+ " and 3,%0,%2 \n" /* calculate '*ptr & mask" */
+ " cmpwi 3,0 \n" /* compare '*ptr & mark' vs 0 */
+ " bne- $+16 \n" /* exit on '*ptr & mark != 0' */
+ " add 3,%0,%3 \n" /* calculate '*ptr + increment' */
+ " stwcx. 3,0,%4 \n" /* try to store '*ptr + increment' into *ptr */
+ " bne- $-24 \n" /* retry if index reservation is broken */
+ #ifdef USE_PPC_LWSYNC
+ " lwsync \n"
+ #else
+ " isync \n"
+ #endif
+ : "=&r"(result), "+m"(*ptr)
+ : "r"(mask), "r"(increment), "r"(ptr)
+ : "memory", "cc", "r3");
+ return result;
+ }
+
+ #else
+
+ /*
+ * Generic implementation via loop of compare & exchange.
+ */
+ uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask_, uint32 add_)
+ {
+ uint32 old_value;
+
+ /*
+ * Read once outside the loop, later iterations will get the newer value
+ * via compare & exchange.
+ */
+ old_value = pg_atomic_read_u32(ptr);
+
+ /* loop until we've determined whether we could make an increment or not */
+ while (true)
+ {
+ uint32 desired_value;
+ bool free;
+
+ desired_value = old_value;
+ free = (old_value & mask_) == 0;
+ if (free)
+ desired_value += add_;
+
+ /*
+ * Attempt to swap in the value we are expecting. If we didn't see
+ * masked bits to be clear, that's just the old value. If we saw them
+ * as clear, we'll attempt to make an increment. The reason that we
+ * always swap in the value is that this doubles as a memory barrier.
+ * We could try to be smarter and only swap in values if we saw the
+ * maked bits as clear, but benchmark haven't shown it as beneficial
+ * so far.
+ *
+ * Retry if the value changed since we last looked at it.
+ */
+ if (pg_atomic_compare_exchange_u32(ptr, &old_value, desired_value))
+ return old_value;
+ }
+ pg_unreachable();
+ }
+
+ #endif
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
new file mode 100644
index ab81d94..766e3de
*** a/src/backend/storage/lmgr/lwlock.c
--- b/src/backend/storage/lmgr/lwlock.c
*************** GetLWLockIdentifier(uint32 classId, uint
*** 727,790 ****
static bool
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{
! uint32 old_state;
AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
/*
! * Read once outside the loop, later iterations will get the newer value
! * via compare & exchange.
*/
! old_state = pg_atomic_read_u32(&lock->state);
! /* loop until we've determined whether we could acquire the lock or not */
! while (true)
{
! uint32 desired_state;
! bool lock_free;
!
! desired_state = old_state;
!
! if (mode == LW_EXCLUSIVE)
! {
! lock_free = (old_state & LW_LOCK_MASK) == 0;
! if (lock_free)
! desired_state += LW_VAL_EXCLUSIVE;
! }
! else
! {
! lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
! if (lock_free)
! desired_state += LW_VAL_SHARED;
! }
!
! /*
! * Attempt to swap in the state we are expecting. If we didn't see
! * lock to be free, that's just the old value. If we saw it as free,
! * we'll attempt to mark it acquired. The reason that we always swap
! * in the value is that this doubles as a memory barrier. We could try
! * to be smarter and only swap in values if we saw the lock as free,
! * but benchmark haven't shown it as beneficial so far.
! *
! * Retry if the value changed since we last looked at it.
! */
! if (pg_atomic_compare_exchange_u32(&lock->state,
! &old_state, desired_state))
! {
! if (lock_free)
! {
! /* Great! Got the lock. */
#ifdef LOCK_DEBUG
! if (mode == LW_EXCLUSIVE)
! lock->owner = MyProc;
#endif
! return false;
! }
! else
! return true; /* somebody else has the lock */
! }
}
- pg_unreachable();
}
/*
--- 727,772 ----
static bool
LWLockAttemptLock(LWLock *lock, LWLockMode mode)
{
! uint32 old_state,
! mask,
! increment;
AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
+ if (mode == LW_EXCLUSIVE)
+ {
+ mask = LW_LOCK_MASK;
+ increment = LW_VAL_EXCLUSIVE;
+ }
+ else
+ {
+ mask = LW_VAL_EXCLUSIVE;
+ increment = LW_VAL_SHARED;
+ }
+
/*
! * Use 'check mask then add' atomic which actually do all the useful job
! * for us.
*/
! old_state = pg_atomic_fetch_mask_add_u32(&lock->state, mask, increment);
! /*
! * If state was free according to the mask, we assume that operation was
! * successful.
! */
! if ((old_state & mask) == 0)
{
! /* Great! Got the lock. */
#ifdef LOCK_DEBUG
! if (mode == LW_EXCLUSIVE)
! lock->owner = MyProc;
#endif
! return false;
! }
! else
! {
! return true; /* somebody else has the lock */
}
}
/*
diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h
new file mode 100644
index 2e2ec27..4ec0219
*** a/src/include/port/atomics.h
--- b/src/include/port/atomics.h
*************** pg_atomic_sub_fetch_u32(volatile pg_atom
*** 415,420 ****
--- 415,433 ----
return pg_atomic_sub_fetch_u32_impl(ptr, sub_);
}
+ /*
+ * pg_atomic_fetch_mask_add_u32 - atomically check that masked bits in variable
+ * and if they are clear then add to variable.
+ *
+ * Returns the value of ptr before the atomic operation.
+ *
+ * Full barrier semantics.
+ */
+ extern uint32
+ pg_atomic_fetch_mask_add_u32(volatile pg_atomic_uint32 *ptr,
+ uint32 mask_, uint32 add_);
+
+
/* ----
* The 64 bit operations have the same semantics as their 32bit counterparts
* if they are available. Check the corresponding 32bit function for
Am Montag, den 06.02.2017, 16:45 +0300 schrieb Alexander Korotkov:
I tried lwlock-power-2.patch on multicore Power machine we have in
PostgresPro.
I realized that using labels in assembly isn't safe. Thus, I removed
labels and use relative jumps instead (lwlock-power-2.patch).
Unfortunately, I didn't manage to make any reasonable benchmarks.
This
machine runs AIX, and there are a lot of problems which prevents
PostgreSQL
to show high TPS. Installing Linux there is not an option too,
because
that machine is used for tries to make Postgres work properly on AIX.
So, benchmarking help is very relevant. I would very appreciate
that.
Okay, so here are some results. The bench runs against
current PostgreSQL master, 24 GByte shared_buffers configured (128
GByte physical RAM), max_wal_size=8GB and effective_cache_size=100GB.
I've just discovered that max_connections was accidently set to 601,
normally i'd have set something near 110 or so...
<master afcb0c97efc58459bcbbe795f42d8b7be414e076>
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16910687
latency average = 0.177 ms
tps = 563654.968585 (including connections establishing)
tps = 563991.459659 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16523247
latency average = 0.182 ms
tps = 550744.748084 (including connections establishing)
tps = 552069.267389 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16796056
latency average = 0.179 ms
tps = 559830.986738 (including connections establishing)
tps = 560333.682010 (excluding connections establishing)
<lw-lock-power-1.patch applied>
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14563500
latency average = 0.206 ms
tps = 485420.764515 (including connections establishing)
tps = 485720.606371 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14618457
latency average = 0.205 ms
tps = 487246.817758 (including connections establishing)
tps = 488117.718816 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14522462
latency average = 0.207 ms
tps = 484052.194063 (including connections establishing)
tps = 485434.771590 (excluding connections establishing)
<lw-lock-power-3.patch applied>
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17946058
latency average = 0.167 ms
tps = 598164.841490 (including connections establishing)
tps = 598582.503248 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17719648
latency average = 0.169 ms
tps = 590621.671588 (including connections establishing)
tps = 591093.333153 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17722941
latency average = 0.169 ms
tps = 590728.715465 (including connections establishing)
tps = 591619.817043 (excluding connections establishing)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 6, 2017 at 8:28 PM, Bernd Helmle <mailings@oopsware.de> wrote:
Am Montag, den 06.02.2017, 16:45 +0300 schrieb Alexander Korotkov:
I tried lwlock-power-2.patch on multicore Power machine we have in
PostgresPro.
I realized that using labels in assembly isn't safe. Thus, I removed
labels and use relative jumps instead (lwlock-power-2.patch).
Unfortunately, I didn't manage to make any reasonable benchmarks.
This
machine runs AIX, and there are a lot of problems which prevents
PostgreSQL
to show high TPS. Installing Linux there is not an option too,
because
that machine is used for tries to make Postgres work properly on AIX.
So, benchmarking help is very relevant. I would very appreciate
that.Okay, so here are some results. The bench runs against
current PostgreSQL master, 24 GByte shared_buffers configured (128
GByte physical RAM), max_wal_size=8GB and effective_cache_size=100GB.
Thank you very much for testing!
Results looks strange for me. I wonder why there is difference between
lwlock-power-1.patch and lwlock-power-3.patch? From my intuition, it
shouldn't be there because it's not much difference between them. Thus, I
have following questions.
1. Have you warm up database? I.e. could you do "SELECT sum(x.x) FROM
(SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i', 'r')
ORDER BY oid) x;" before each run?
2. Also could you run each test longer: 3-5 mins, and run them with
variety of clients count?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi,
On 2017-02-03 20:01:03 +0300, Alexander Korotkov wrote:
Using assembly in lwlock.c looks rough. This is why I refactored it by
introducing new atomic operation pg_atomic_fetch_mask_add_u32 (see
lwlock-power-2.patch). It checks that all masked bits are clear and then
adds to variable. This atomic have special assembly implementation for
Power, and generic implementation for other platforms with loop of CAS.
Probably we would have other implementations for other architectures in
future. This level of abstraction is the best I managed to invent.
I think that's a reasonable approach. And I think it might be worth
experimenting with a more efficient implementation on x86 too, using
hardware lock elision / HLE and/or tsx.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, 2017-02-06 at 22:44 +0300, Alexander Korotkov wrote:
Results looks strange for me. I wonder why there is difference
between
lwlock-power-1.patch and lwlock-power-3.patch? From my intuition, it
shouldn't be there because it's not much difference between them.
Thus, I
have following questions.
Yeah, i've realized that as well.
1. Have you warm up database? I.e. could you do "SELECT sum(x.x)
FROM
(SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i',
'r')
ORDER BY oid) x;" before each run?
2. Also could you run each test longer: 3-5 mins, and run them
with
variety of clients count?
The results i've posted were the last 3 run of 9 in summary. I hoped
that should be enough to prewarm the system. I'm going to repeat the
tests with the changes you've requested, though.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Am Montag, den 06.02.2017, 22:44 +0300 schrieb Alexander Korotkov:
2. Also could you run each test longer: 3-5 mins, and run them
withvariety of clients count?
So here are some other results. I've changed max_connections to 300.
The bench was prewarmed and run 300s each.
I could run more benches, if necessary.
Attachments:
On Tue, Feb 7, 2017 at 3:16 PM, Bernd Helmle <mailings@oopsware.de> wrote:
Am Montag, den 06.02.2017, 22:44 +0300 schrieb Alexander Korotkov:
2. Also could you run each test longer: 3-5 mins, and run them with
variety of clients count?So here are some other results. I've changed max_connections to 300. The
bench was prewarmed and run 300s each.
I could run more benches, if necessary.
Thank you very much for benchmarks!
There is clear win of both lwlock-power-1.patch and lwlock-power-3.patch in
comparison to master. Difference between lwlock-power-1.patch and
lwlock-power-3.patch seems to be within the margin of error. But win isn't
as high as I observed earlier. And I wonder why absolute numbers are lower
than in our earlier experiments. We used IBM E880 which is actually two
nodes with interconnect. However interconnect is not fast enough to make
one PostgreSQL instance work on both nodes. Thus, used half of IBM E880
which has 4 sockets and 32 physical cores. While you use IBM E850 which is
really single node with 4 sockets and 48 physical cores. Thus, it seems
that you have lower absolute numbers on more powerful hardware. That makes
me uneasy and I think we probably don't get the best from this hardware.
Just in case, do you use SMT=8?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov:
But win isn't
as high as I observed earlier. And I wonder why absolute numbers are
lower
than in our earlier experiments. We used IBM E880 which is actually
two
Did you run your tests on bare metal or were they also virtualized?
nodes with interconnect. However interconnect is not fast enough to
make
one PostgreSQL instance work on both nodes. Thus, used half of IBM
E880
which has 4 sockets and 32 physical cores. While you use IBM E850
which is
really single node with 4 sockets and 48 physical cores. Thus, it
seems
that you have lower absolute numbers on more powerful hardware. That
makes
me uneasy and I think we probably don't get the best from this
hardware.
Just in case, do you use SMT=8?
Yes, SMT=8 was used.
The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The
Ubuntu LPAR running on PowerVM isn't using all physical cores,
currently 28 cores are assigned (=224 SMT Threads). The other cores are
dedicated to the PowerVM hypervisor and a (very) small AIX LPAR.
I've done more pgbenches this morning with SMT-4, too, fastest result
with master was
SMT-4
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 167306423
latency average = 0.179 ms
latency stddev = 0.072 ms
tps = 557685.144912 (including connections establishing)
tps = 557835.683204 (excluding connections establishing)
compared with SMT-8:
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 173476449
latency average = 0.173 ms
latency stddev = 0.059 ms
tps = 578250.676019 (including connections establishing)
tps = 578412.159601 (excluding connections establishing)
and retried with lwlocks-power-3, SMT-4:
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 185991995
latency average = 0.161 ms
latency stddev = 0.059 ms
tps = 619970.030069 (including connections establishing)
tps = 620112.263770 (excluding connections establishing)
credativ@iicl183:~/git/postgres$
...and SMT-8
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 185878717
latency average = 0.161 ms
latency stddev = 0.047 ms
tps = 619591.476154 (including connections establishing)
tps = 619655.867280 (excluding connections establishing)
Interestingly the lwlocks patch seems to decrease the SMT influence
factor.
Side note: the system makes around 2 Mio Context Switches during the
benchmarks, e.g.
awk '{print $12;}' /tmp/vmstat.out
cs
10
2153533
2134864
2141623
2126845
2128330
2127454
2145325
2126769
2134492
2130246
2130071
2142660
2136077
2126783
2126107
2125823
2136511
2137752
2146307
2141127
I've also tried a more recent kernel this morning (4.4 vs. 4.8), but
this didn't change the picture. Is there anything more i can do?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Feb 8, 2017 at 5:00 PM, Bernd Helmle <mailings@oopsware.de> wrote:
Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov:
But win isn't
as high as I observed earlier. And I wonder why absolute numbers are
lower
than in our earlier experiments. We used IBM E880 which is actually
twoDid you run your tests on bare metal or were they also virtualized?
I run tests on bare metal.
nodes with interconnect. However interconnect is not fast enough to
make
one PostgreSQL instance work on both nodes. Thus, used half of IBM
E880
which has 4 sockets and 32 physical cores. While you use IBM E850
which is
really single node with 4 sockets and 48 physical cores. Thus, it
seems
that you have lower absolute numbers on more powerful hardware. That
makes
me uneasy and I think we probably don't get the best from this
hardware.
Just in case, do you use SMT=8?Yes, SMT=8 was used.
The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The
Ubuntu LPAR running on PowerVM isn't using all physical cores,
currently 28 cores are assigned (=224 SMT Threads). The other cores are
dedicated to the PowerVM hypervisor and a (very) small AIX LPAR.
Thank you very much for the explanation.
Thus, I see reasons why in your tests absolute results are lower than in my
previous tests.
1. You use 28 physical cores while I was using 32 physical cores.
2. You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3. I guess you run pgbench on the same machine. While in my tests pgbench
was running on another node of IBM E880.
Therefore, having lower absolute numbers in your tests, win of LWLock
optimization is also lower. That is understandable. But win of LWLock
optimization is clearly visible definitely exceeds variation.
I think it would make sense to run more kinds of tests. Could you try set
of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still valuable
to see that there is no regression there.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 02/11/2017 01:42 PM, Alexander Korotkov wrote:
I think it would make sense to run more kinds of tests. Could you try
set of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still
valuable to see that there is no regression there.
FWIW it shouldn't be difficult to tweak my scripts and run them on
another machine. You'd have to customize the parameters (scales, client
counts, ...) and there are a few hard-coded paths, but that's about it.
regards
Tomas
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
Thus, I see reasons why in your tests absolute results are lower than
in my
previous tests.
1. You use 28 physical cores while I was using 32 physical cores.
2. You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3. I guess you run pgbench on the same machine. While in my tests
pgbench
was running on another node of IBM E880.
Yeah, pgbench was running locally. Maybe we can get some resources to
run them remotely. Interesting side note: If you run a second postgres
instance with the same pgbench in parallel, you get nearly the same
transaction throughput as a single instance.
Short side note:
If you run two Postgres instances concurrently with the same pgbench
parameters, you get nearly the same transaction throughput for both
instances each as when running against a single instance, e.g.
- single
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 112
duration: 300 s
number of transactions actually processed: 121523797
latency average = 0.276 ms
latency stddev = 0.096 ms
tps = 405075.282309 (including connections establishing)
tps = 405114.299174 (excluding connections establishing)
instance-1/instance-2 concurrently run:
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 56
duration: 300 s
number of transactions actually processed: 120645351
latency average = 0.278 ms
latency stddev = 0.158 ms
tps = 402148.536087 (including connections establishing)
tps = 402199.952824 (excluding connections establishing)
transaction type: <builtin: select only>
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 56
duration: 300 s
number of transactions actually processed: 121959772
latency average = 0.275 ms
latency stddev = 0.110 ms
tps = 406530.139080 (including connections establishing)
tps = 406556.658638 (excluding connections establishing)
So it looks like the machine has plenty of power, but PostgreSQL is
limiting somewhere.
Therefore, having lower absolute numbers in your tests, win of LWLock
optimization is also lower. That is understandable. But win of
LWLock
optimization is clearly visible definitely exceeds variation.I think it would make sense to run more kinds of tests. Could you
try set
of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still
valuable
to see that there is no regression there.
Unfortunately there are some test for AIX scheduled, which will assign
resources to that LPAR...i've just talked to the people responsible for
the machine and we can get more time for the Linux tests ;)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/13/2017 03:16 PM, Bernd Helmle wrote:
Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
Thus, I see reasons why in your tests absolute results are lower than
in my
previous tests.
1. You use 28 physical cores while I was using 32 physical cores.
2. You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3. I guess you run pgbench on the same machine. While in my tests
pgbench
was running on another node of IBM E880.Yeah, pgbench was running locally. Maybe we can get some resources to
run them remotely. Interesting side note: If you run a second postgres
instance with the same pgbench in parallel, you get nearly the same
transaction throughput as a single instance.Short side note:
If you run two Postgres instances concurrently with the same pgbench
parameters, you get nearly the same transaction throughput for both
instances each as when running against a single instance, e.g.
That strongly suggests you're hitting some kind of lock. It'd be good to
know which one. I see you're doing "pgbench -S" which also updates
branches and other tiny tables - it's possible the sessions are trying
to update the same row in those tiny tables. You're running with scale
1000, but with 100 it's still possible thanks to the birthday paradox.
Otherwise it might be interesting to look at sampling wait events, which
might tell us more about the locks.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Feb 13, 2017 at 10:17 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com
wrote:
On 02/13/2017 03:16 PM, Bernd Helmle wrote:
Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
Thus, I see reasons why in your tests absolute results are lower than
in my
previous tests.
1. You use 28 physical cores while I was using 32 physical cores.
2. You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3. I guess you run pgbench on the same machine. While in my tests
pgbench
was running on another node of IBM E880.Yeah, pgbench was running locally. Maybe we can get some resources to
run them remotely. Interesting side note: If you run a second postgres
instance with the same pgbench in parallel, you get nearly the same
transaction throughput as a single instance.Short side note:
If you run two Postgres instances concurrently with the same pgbench
parameters, you get nearly the same transaction throughput for both
instances each as when running against a single instance, e.g.That strongly suggests you're hitting some kind of lock. It'd be good to
know which one. I see you're doing "pgbench -S" which also updates branches
and other tiny tables - it's possible the sessions are trying to update the
same row in those tiny tables. You're running with scale 1000, but with 100
it's still possible thanks to the birthday paradox.Otherwise it might be interesting to look at sampling wait events, which
might tell us more about the locks.
+1
And you could try to use pg_wait_sampling
<https://github.com/postgrespro/pg_wait_sampling> to sampling of wait
events.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
+1
And you could try to use pg_wait_sampling
<https://github.com/postgrespro/pg_wait_sampling> to sampling of wait
events.
Okay, i'm going to try this. Currently Tomas' scripts are still
running, i'll provide updates as soon as they are finished.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
I think it would make sense to run more kinds of tests. Could you
try set
of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still
valuable
to see that there is no regression there.
Sorry for the delay.
But here are the results after having run Tomas' benchmarks scripts.
The pgbench-ro benchmark shows a clear win of the lwlock-power-3 patch,
also you see a slight advantage of this patch in the pgbench-rw graph.
At least, i don't see any signifikant regression somewhere.
The graphs are showing the average values over all five runs for each
client count.
Attachments:
pgbench-ro-simple.pngimage/png; name=pgbench-ro-simple.pngDownload
�PNG
IHDR ] T �) pHYs � �-z8 e�IDATx���\������� S6"*
�:Q��P�������p�g�����:k����*�D-
�Q��Edo��/�Id�`���E������c��or�#
B!�D8�= �B��\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���/O�<9z�h�,����iSll����
W�X1f�x�6B�Z%_.:�����.))i��=�dii)�/[�l��i111>>>666
ottt��� B!$Q��Q����]�v��9""BMMm�������5h���� ooo�7._�\1;�BU�������m������tBB�����!''���8}}}�7�l�!���j���������MLL`���@CCC����f����*j6x�PS�f�P��@��&�XRR�o����hfV[[���P�h~~���������[nT!!!~~~5��P���=�4j����������������������������Fc
��B��&�x���6m��g{��!�n�:c��+W�DDD�]����J����k�B�r5���/_B�(Y�<s��%K�XXX���A{m4"�B��&�)U����-**�B�Z���C!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���\D!�$0B! �E�BHs!���;srrf�����:}���+WBcll����
W�X1f��ZjD!�j���8m�4--��������#G���;;;:t��e�PLL�����������k�[�B�����N�z�������������1""BMM
jG����4hPPP������/_���G!��������377��}���{9���9s\\\�}����������X��C5y�B��I�G��{ H9���YYY/_��������>|��W/���(�}455D�(=���P�@Pnx!!!r�B��+d>I�E��������� % _.B���6k�,��\]]G��\aa��O~~�����G���[nl�~~�z!��yG{��l�s���!m�f�� |�������Cbiii��P(d�X...�������sss��F��!����<}��&x�RQ=�z����j���E���{/]����~{����#G�����G�@�u��3f\�r%""b���VVV
o��oB�1)���zy�����__��K��,����Q���r��qx����-��N�8������ ������g���d����@;;�ZjD���K������~�I���o���P��[����y}I@���~�H�������oqss�����F�{������_�{�~_(, ^El6�j���^��T��hg�d�G�!%P�+�M:����\��K���B���q����k/.-�P��~u�����H���\D5P!?)3����Goc�&_�h��'�b[5k��y{���[w0���v����+������f�&�4l��0B
H7�I���i7��� ������v[��V}�Z��U7�~�����J� 3R0QU0B���$'!5&���Goo����2��nfzv]��t�]������x~3��"B����)W���������GU5��n�l��l��b��������&�E�P�J�{y�qP��#Y�������������Lum�~����\D������ �#��-���E��
Z�2����������6�A�<������*��lOT�� �Ha."�/�8�����OO�}���H
;�
nc���������G�<���9wcB^�z?G��D������E���$e��~u������Vu��bC�t������R����Z����b$�$9�D�������"B�����7�c_�C"���.�(E(W����'t�����{Y��L��u���I~��M8��n$H�`."�j��/��:2&)���s�%���(���+daW�!�s���3�U�@�P�����] xT�E���DE^U�r�%w�/G�8qX��{�����[�n�����Q-��
%�I��$?�p�?E�)�!z����~���"BH"�8k��������N��9�0����Ig+����������O+So�j�
��Er���0:��h*�� yW���:Rb��!����,�]y|l������P;��{����C*�Yj����f���o��(�U��NH�������C�8���I��UUyB5N�M��OX�u2N��0Ber�3���� Ku�� ;�6W���x(Ms���~^6�����W����$'�.
�o>��O��hu"�]����!�1���F�O���u1X��0BeV��e&J%��/�����&�l���q������w��]�x94��������w��L:6SO$�&�����E�-������e3R��i���[������gm�h�Wt����[��V]g���$�.D��>7_��#+��~���,��8d��a."������}�h����
R��W�r��V��mCF�g���^�W�+��'�}��P�+��<��j���!~m���Qj���^E$^~|��5��??����b�'(�?�:N�,Q���,-����|Ftz� ���a."������k��v�~u��+�2����Rc_�<��s���J�HA4)��O�V��cH�t���h��V���&To������Wt5�xX�������@�4�����������j[#�����|[��#����k��������8��!z�����m�����P��]��h����y%Y�F�P������X�/�ty6���|�_mM)�`A��B3W�S�/4#�fK���_x�n��a."�����/����>
�I�tI]E���0�I-�[2-]���p%O�s*�/�����8����w��Md����[A�a."�8������-��#�R�����O�/{9��R��t�^f����e�6!���0�}�a���b
7��Q�� ��)�}E�Q�����?��>N�)�no�������*91�2I�]��z��]z��+:�>��EAA%=��� ��8�D�� !����P#!
n�8s����������`���uz+S�*��t�?[��A�jG�,�����%7��W�K����NO��Qm���@��a."���������mI��h
uaW���� �x�!b�T��zRt�|��5)~N������C*�D���M��T����P=�\DH���+����mN��|������8��u�����+���$t,_3'��'���������]>q�5X��)%H��������,n�P���<~��4�� ���O�(}N��Ne�����'z����A����\DH������'���N*�Tu��L��+m5}yV&�O:�>A��IP�n�ak��t"���PS������D��x����iy���:j����:r����>F��nJ%��Z�l����'�!%�����D������S����P��N�q���"�I��4�:B\Z��D�B��$}��]@4=0Q�$w.jii���k������166v���������+V�3f<T�55B"�~~���u)9O��Zjz[O�2Q�D,yF��������A;[���!���vg����G��&��&6��)�r�������KKK��l��i��������xxx���(�X���P�v/�����>K��QSU���T��Sd}�T��O=��K����BF�5����[����~U��)�r1;;�����n�|��������a���k��AAAA���
o\�|�B����q����k����@i"��P,��
^�<D�R?hW1#��D���8����O9B�N�\����p8&L�~����^�HBB����d&''���8�N�7�d�R6/������s��Z�V������L�U��bn�����D�'i6�h����X58B��|�5�����N�z���7n������hhh��hjj�(�Qz$���A��������;5(�������G���aQl;5��Z5�"�GU�8���k�������Z��+�H���<�W!b��������r���z��]�t�N���{��)77��B�m�����E�(=��rc�P����kwj �K����%<~7�_��P�����a����r���}�[�����p#�U�|m)U�Z���Hv��bjjjrr���;3��rUUU]\\��_/�IY�r�BJ��W�p���m����6��x|ok���������
��-S5�?�4�@4dXB�C��bbbb���#""�X���
:~�x�=���[g��q��xt���VVV
o��oB��/�]~r�����.L7��B"�������-�2v��?h�4'��������H�P��b�.]6n�8a��w������t�>}�=88x���K�,��������F��[/�����l��/�u�FvX��n0EU{�����NM���4��v��p}�6� 0B�F�?��"�������P�)��ww���>CO�hX�9���rX*�-��&�Hf �eH!u}��d��C��"�KK��HZ��C7��x"|�������t?�i0Q����$}'�:L����;`4M�������)�E�j]^I���D<
� �L�������������8���IrB>8�F��N�C�%�E�B2�\D�q�%a�����Y�pS�~c<�7�s�n�����6����{�Z����5�WnC)�"B�B(\{|�����������q��I�����$i��\���I���^��������������?��g���-&:��=���HUlB�w��m"E��)�����7E� �E�j�\����Z>���5E_g?9ts����>:�������<���M�=�m!��$m��
"���F�:���P���Y����+��Y�������'G���cdT��>.����FKU����<�L���7G�T���FT���=@����P
�K�_�+bS*�wR,V���KxE�C��MG{|o�SE� KH���O�1�:}(HD��>B�r������U�/_����w7��]L��x.qh�V�b�"�y�����mYZ��8b4�p�ju����\D�&.=>\�+����D
K�1����]�2�B��������5lb�%}�[��
#�j�"BrK�I�;���(�e=���6�Y��M ��K�A�(}((
���e"K�v����"Br�.L;z��.=���l�������"}Q�=�?G��1!�g�/�B
�"B2)��<��t��%�����������c=���x�$�o���
$�TL�Q i6
/X�P�����G@ix������e���E���������/j�y���|�Z��'�j��!� 0������c~z����E�E ��C��]����N��z��>�TL��4��>I����2�?T�*����/��;��-���#�/{�#9���LA�%�����D�����Z�YD���jo5�jH0*/�$��������\����:��6�*Z����D���V��i>��
�DDH�`."$����}����?
J��(V7�a_�/n�%u�_�]$�S��PoM�g�~x����"B4!�xr��/iy���N&��~�3j[�7���t� ��F3b�����DDH�a."D����sc��4�-�k������_��B�<&����SD��E������5���I�)J������Ba�q4ZjzC���q��������G��s�2��HP*D(����PD�Q�\DMO�
����
E�|�����k5q��\m5��?$i��sRWB%D�;��%)���G,M$�����B� s5Ew^]���M�����,zL���\���~Eqt"�]� u>#��!�����
�$�4����"�h`.���m����!�-����;.s���A��{�D�(�D����!R��X�NR~"&^C!��0QSQ�+<yo��r��L����P�o|[O��T$������$������C������J�:��a�:l�P�\DMBlR��+2
R�Y�P�[����^_���S������P����O����K�{���2�7N���N����qI����'�y���c]Q":��XB
�"j�xnH����6�������f�����GS���d������5,���4�C��a��������o�vF-I�Idf)B}�8r�����%�I�z�����! �E���g��^y�� q����W�v2�,�/M�k��SRg�S�wM����F5m���Q��xB1�$��UW��o?��d6K���}K�m"YG���~ ���&&��z��1B���\D�DZ������{}U���f��^�+�?���F2�A�d��t"j����"���bNN�������o������g�NLL444\�b��1cj����B�������T�-`Z�5�'{��\�g 3���m��+YF���,&Z���!���0�����ZvU�����C�.[�l��i111>>>666
ott��!Py/�v\[���.3KQ�~�&|�����6}����$m=����z+b����������mfAV~q�vsCm=-5��s����}�6k�LqC��2�;v��?��5�f�bZj���I.�9s&>>~�� o������/���M�>����
����������`����x������'�2--�[N�����=�w���!��%�Z��D^�����K��������S�d_���K����=�
a0����[��|b=V}�%w.feeA�B4?~�iIHHpq�\
���)..N___���5b���l�:_\&rX*��~=��l��P����jR%��1����l}O(�jM��
�y��W���*x�����!�\]]a""""%%�{��6l`�Xs���'a������iS�N���������=O�>
��PfGFF>~�x����F��>aaa��}NN��o��I���j�FGGC���X�j��/^����"l��u���333��&`��G�����3|>���b���}����g�
�
2j;�9~�x�=�<y���'��|b=V}�%w.B(��1�U+��{�YMM��7�;T�( ����;�n���K�;��3��ne�Lxi���$����&1��M',�z1B��gm��g�:88@�������m��������9����t1v���r����
A��pBBB`AH���w���/^��
9r����}}}!/!o������Sq�������Um������g��P�@YTT�[jj*EInAZ�&��+dr���K0�^�z%''O�:����P�m��m�������'��Z>���G _.�������������������k�(�Qz��3�W7���r�R:����=��'�,�b���mM�<�zT���^7��*�&���^�y=�X�L��o������u�9{k�,����;p����+��'"*���u��_
!�o�>H��"��%���������<�0f����0���i����=z���{�����i���g]---E�����n{3g�
�c�����6��+���������P�B(BO���-[&���V�,�r�*t{{{":$��O�<�����_/�Q������F���k�rc�P��#r�R.�>���6�V�Z�����Yk�N��LJ_I�jw�L�Z�;Z��HQ���/���a& f�P��J��.\���o�u��m����~�-�������ID��#=����U������C0y����VzYYYo�����af�">|83mddTq�7����{�������'O^�r�Y-T��"0�-Z��=Q`�U���(�^�j���};�#����[g����v�Z+++�7�5T�������>��)f��X>����D��D^�#��J��9��%"�X@gmm
�����/�D l6m����F��Tx����^�
���� <&&V�%�����^
*����K7�}K�]�X�fP#n��������h �����G���jR������p^?���WPM/Y����"00Jl"z�U��� �{}u��yY�e�&:�P&:9��_If���5l]�Bn��
�V�j�w+a���t������5���*J�����+322 l6l� �����7o���x�"z{0����d�h����,��jPEU�Z��@dN�:5<<���s�������\�M������{�� &0=;w�<e�����������Q=z���*��� K�.O�71**�\��hDM
O�
��������/���q�������� �]Y?�MF���� �kGMY;;����������/���g������
�33����C�8\����K���`zzz�w�#G�
��
�k���P�A�y����������CBB���7k�,�9((�Y:-��]�t�/_�q��A���e�jP*(va��T��m�U�Pnji���q������B<CE[{U��F
Tj^���~�~���Q3���WO3g�z�o�jz��z����P��V�4�]�������x���1 ���Q]]]q�J���2����O;;;�����9w�����#���/�(w\k�� oo����W�-iUm��>w����\�`���r�JB�������0QC���_�~+>���E��]3(8N��%���3���~G�����F��>1����|
---8����/���������0Q������9<~3�fq���=���2��c"�b�f�����>r\B�������k��EEEPe<x��������"j@��<�p) )3��m��������'/F�;e��&?���6J��������`����"j(�??���b�{�mLo����[R�\���A���Ix�)B�����!���_���`�;��T�y,�o�Db��HT���.U��s��&