Inlining comparators as a performance optimisation
Recent discussions on the threads "Double sorting split patch" and
"CUDA sorting" raised the possibility that there could be significant
performance optimisation "low-hanging fruit" picked by having the
executor treat integers and floats as a special case during sorting,
avoiding going to the trouble of calling a comparator using the
built-in SQL function machinery, and taking advantage of inlining of
the comparator, which has been shown to have a considerable
performance advantage (at least compared to a general purpose c stdlib
qsort(), that takes a function pointer as its comparator, much like
tuplesort).
I've hacked together a sloppy POC implementation in a hurry
(basically, some code is shifted around) , which is attached - I felt
that it would be useful to determine if the community feels that this
is a worth-while undertaking in advance of a business trip that I'm
leaving on tomorrow lasting until Friday, during which I will be
mostly unavailable. The patch breaks the Postgres sorting executor
node (at least when it quicksorts) for any type other than int4. I
apologise for how rough the patch is, but the code itself isn't
important right now - the idea is. I anticipate that the value
state->datumType or something similar will be set in a near future
revision, so that tuplesort_performsort will know which type-specific
optimisation it can use for the type, while falling back on the
existing generic qsort_arg + qsort_arg_comparator, and sorting won't
actually be broken.
I've been doing some preliminary testing using the dell store 2 sample
database. I increase work_mem to '50MB', to ensure that a quicksort
will be performed for sorting (otherwise, I'm using the
postgresql.conf that initdb gave me). The query is:
explain analyze select * from orderlines order by prod_id;
Once the cache has been warmed, explain analyze very consistently
reports a runtime of 123ms for this query on master/HEAD, which varies
+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
this patch, that goes down to 107ms +/- 1ms at -O0. I think that
that's a pretty good start. Funnily enough, the difference/advantage
vanishes at -O2 (I'm guessing that the higher optimisation level of
GCC 4.5 hyper-corrects away the inlining, but I don't have time to
check that right now).
I imagine the version that I actually submit for patch review will
have a macro-based infrastructure for inlining the sorting of various
built-in types, initially integers and floats. It will most likely
have some other optimisations - I haven't even used a profiler yet.
This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
inline_comp.patchtext/x-patch; charset=US-ASCII; name=inline_comp.patchDownload
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..c5ac708 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1224,6 +1224,285 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
}
}
+inline int
+compare_int4( Datum first, Datum second)
+{
+ int32 a_is = DatumGetInt32(first);
+ int32 b_is = DatumGetInt32(second);
+
+ if (a_is > b_is)
+ return 1;
+ else if (a_is == b_is)
+ return 0;
+ else
+ return -1;
+}
+
+
+/*
+ * Inline-able copy of FunctionCall2Coll() to save some cycles in sorting.
+ */
+static inline Datum
+myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
+{
+ FunctionCallInfoData fcinfo;
+ Datum result;
+
+ InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL);
+
+ fcinfo.arg[0] = arg1;
+ fcinfo.arg[1] = arg2;
+ fcinfo.argnull[0] = false;
+ fcinfo.argnull[1] = false;
+
+ result = FunctionCallInvoke(&fcinfo);
+
+ /* Check for null result, since caller is clearly not expecting one */
+ if (fcinfo.isnull)
+ elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);
+
+ return result;
+}
+/*
+ * Apply a sort function (by now converted to fmgr lookup form)
+ * and return a 3-way comparison result. This takes care of handling
+ * reverse-sort and NULLs-ordering properly. We assume that DESC and
+ * NULLS_FIRST options are encoded in sk_flags the same way btree does it.
+ */
+static inline int32
+inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
+ Datum datum1, bool isNull1,
+ Datum datum2, bool isNull2)
+{
+ int32 compare;
+
+ if (isNull1)
+ {
+ if (isNull2)
+ compare = 0; /* NULL "=" NULL */
+ else if (sk_flags & SK_BT_NULLS_FIRST)
+ compare = -1; /* NULL "<" NOT_NULL */
+ else
+ compare = 1; /* NULL ">" NOT_NULL */
+ }
+ else if (isNull2)
+ {
+ if (sk_flags & SK_BT_NULLS_FIRST)
+ compare = 1; /* NOT_NULL ">" NULL */
+ else
+ compare = -1; /* NOT_NULL "<" NULL */
+ }
+ else
+ {
+ compare = compare_int4(datum1, datum2);
+
+ if (sk_flags & SK_BT_DESC)
+ compare = -compare;
+ }
+
+ return compare;
+}
+
+
+
+inline int
+comparetup_heap_int4(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+{
+ ScanKey scanKey = state->scanKeys;
+ HeapTupleData ltup;
+ HeapTupleData rtup;
+ TupleDesc tupDesc;
+ int nkey;
+ int32 compare;
+
+ /* Allow interrupting long sorts */
+ CHECK_FOR_INTERRUPTS();
+ /* Compare the leading sort key */
+ compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+ scanKey->sk_collation,
+ a->datum1, a->isnull1,
+ b->datum1, b->isnull1);
+ if (compare != 0)
+ return compare;
+
+
+ /* Compare additional sort keys */
+ ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
+ ltup.t_data = (HeapTupleHeader) ((char *) a->tuple - MINIMAL_TUPLE_OFFSET);
+ rtup.t_len = ((MinimalTuple) b->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
+ rtup.t_data = (HeapTupleHeader) ((char *) b->tuple - MINIMAL_TUPLE_OFFSET);
+ tupDesc = state->tupDesc;
+ scanKey++;
+
+ for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ {
+ AttrNumber attno = scanKey->sk_attno;
+ Datum datum1,
+ datum2;
+ bool isnull1,
+ isnull2;
+
+ datum1 = heap_getattr(<up, attno, tupDesc, &isnull1);
+ datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
+
+ compare = compare_int4(datum1, datum2);
+ return compare;
+ }
+
+ return 0;
+}
+
+
+inline static char *med3(char *a, char *b, char *c,
+ void *arg);
+inline static void swapfunc(char *, char *, size_t, int);
+
+/*
+ * Qsort routine based on J. L. Bentley and M. D. McIlroy,
+ * "Engineering a sort function",
+ * Software--Practice and Experience 23 (1993) 1249-1265.
+ * We have modified their original by adding a check for already-sorted input,
+ * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
+ */
+#define swapcode(TYPE, parmi, parmj, n) \
+do { \
+ size_t i = (n) / sizeof (TYPE); \
+ TYPE *pi = (TYPE *)(void *)(parmi); \
+ TYPE *pj = (TYPE *)(void *)(parmj); \
+ do { \
+ TYPE t = *pi; \
+ *pi++ = *pj; \
+ *pj++ = t; \
+ } while (--i > 0); \
+} while (0)
+
+#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
+ (es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;
+
+inline static void
+swapfunc(char *a, char *b, size_t n, int swaptype)
+{
+ if (swaptype <= 1)
+ swapcode(long, a, b, n);
+ else
+ swapcode(char, a, b, n);
+}
+
+#define swap(a, b) \
+ if (swaptype == 0) { \
+ long t = *(long *)(void *)(a); \
+ *(long *)(void *)(a) = *(long *)(void *)(b); \
+ *(long *)(void *)(b) = t; \
+ } else \
+ swapfunc(a, b, es, swaptype)
+
+#define vecswap(a, b, n) if ((n) > 0) swapfunc((a), (b), (size_t)(n), swaptype)
+
+inline static char *
+med3(char *a, char *b, char *c, void *arg)
+{
+ return comparetup_heap_int4(a, b, arg) < 0 ?
+ (comparetup_heap_int4(b, c, arg) < 0 ? b : (comparetup_heap_int4(a, c, arg) < 0 ? c : a))
+ : (comparetup_heap_int4(b, c, arg) > 0 ? b : (comparetup_heap_int4(a, c, arg) < 0 ? a : c));
+}
+
+
+
+void
+qsort_arg_int4(void *a, size_t n, size_t es, void *arg)
+{
+ char *pa,
+ *pb,
+ *pc,
+ *pd,
+ *pl,
+ *pm,
+ *pn;
+ int d,
+ r,
+ swaptype,
+ presorted;
+
+loop:SWAPINIT(a, es);
+ if (n < 7)
+ {
+ for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
+ for (pl = pm; pl > (char *) a && comparetup_heap_int4(pl - es, pl, arg) > 0;
+ pl -= es)
+ swap(pl, pl - es);
+ return;
+ }
+ presorted = 1;
+ for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)
+ {
+ if (comparetup_heap_int4(pm - es, pm, arg) > 0)
+ {
+ presorted = 0;
+ break;
+ }
+ }
+ if (presorted)
+ return;
+ pm = (char *) a + (n / 2) * es;
+ if (n > 7)
+ {
+ pl = (char *) a;
+ pn = (char *) a + (n - 1) * es;
+ if (n > 40)
+ {
+ d = (n / 8) * es;
+ pl = med3(pl, pl + d, pl + 2 * d, arg);
+ pm = med3(pm - d, pm, pm + d, arg);
+ pn = med3(pn - 2 * d, pn - d, pn, arg);
+ }
+ pm = med3(pl, pm, pn, arg);
+ }
+ swap(a, pm);
+ pa = pb = (char *) a + es;
+ pc = pd = (char *) a + (n - 1) * es;
+ for (;;)
+ {
+ while (pb <= pc && (r = comparetup_heap_int4(pb, a, arg)) <= 0)
+ {
+ if (r == 0)
+ {
+ swap(pa, pb);
+ pa += es;
+ }
+ pb += es;
+ }
+ while (pb <= pc && (r = comparetup_heap_int4(pc, a, arg)) >= 0)
+ {
+ if (r == 0)
+ {
+ swap(pc, pd);
+ pd -= es;
+ }
+ pc -= es;
+
+ }
+ if (pb > pc)
+ break;
+ swap(pb, pc);
+ pb += es;
+ pc -= es;
+ }
+ pn = (char *) a + n * es;
+ r = Min(pa - (char *) a, pb - pa);
+ vecswap(a, pb - r, r);
+ r = Min(pd - pc, pn - pd - es);
+ vecswap(pb, pn - r, r);
+ if ((r = pb - pa) > es)
+ qsort_arg_int4(a, r / es, es, arg);
+ if ((r = pd - pc) > es)
+ {
+ /* Iterate rather than recurse to save stack space */
+ a = pn - r;
+ n = r / es;
+ goto loop;
+ }
+}
+
/*
* All tuples have been provided; finish the sort.
*/
@@ -1247,11 +1526,14 @@ tuplesort_performsort(Tuplesortstate *state)
* amount of memory. Just qsort 'em and we're done.
*/
if (state->memtupcount > 1)
- qsort_arg((void *) state->memtuples,
+ {
+ /* For this POC patch, pretend we only ever sort int4 */
+ qsort_arg_int4((void *) state->memtuples,
state->memtupcount,
sizeof(SortTuple),
- (qsort_arg_comparator) state->comparetup,
(void *) state);
+
+ }
state->current = 0;
state->eof_reached = false;
state->markpos_offset = 0;
@@ -2628,72 +2910,6 @@ SelectSortFunction(Oid sortOperator,
}
/*
- * Inline-able copy of FunctionCall2Coll() to save some cycles in sorting.
- */
-static inline Datum
-myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
-{
- FunctionCallInfoData fcinfo;
- Datum result;
-
- InitFunctionCallInfoData(fcinfo, flinfo, 2, collation, NULL, NULL);
-
- fcinfo.arg[0] = arg1;
- fcinfo.arg[1] = arg2;
- fcinfo.argnull[0] = false;
- fcinfo.argnull[1] = false;
-
- result = FunctionCallInvoke(&fcinfo);
-
- /* Check for null result, since caller is clearly not expecting one */
- if (fcinfo.isnull)
- elog(ERROR, "function %u returned NULL", fcinfo.flinfo->fn_oid);
-
- return result;
-}
-
-/*
- * Apply a sort function (by now converted to fmgr lookup form)
- * and return a 3-way comparison result. This takes care of handling
- * reverse-sort and NULLs-ordering properly. We assume that DESC and
- * NULLS_FIRST options are encoded in sk_flags the same way btree does it.
- */
-static inline int32
-inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
- Datum datum1, bool isNull1,
- Datum datum2, bool isNull2)
-{
- int32 compare;
-
- if (isNull1)
- {
- if (isNull2)
- compare = 0; /* NULL "=" NULL */
- else if (sk_flags & SK_BT_NULLS_FIRST)
- compare = -1; /* NULL "<" NOT_NULL */
- else
- compare = 1; /* NULL ">" NOT_NULL */
- }
- else if (isNull2)
- {
- if (sk_flags & SK_BT_NULLS_FIRST)
- compare = 1; /* NOT_NULL ">" NULL */
- else
- compare = -1; /* NOT_NULL "<" NULL */
- }
- else
- {
- compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
- datum1, datum2));
-
- if (sk_flags & SK_BT_DESC)
- compare = -compare;
- }
-
- return compare;
-}
-
-/*
* Non-inline ApplySortFunction() --- this is needed only to conform to
* C99's brain-dead notions about how to implement inline functions...
*/
Peter Geoghegan <peter@2ndquadrant.com> writes:
Once the cache has been warmed, explain analyze very consistently
reports a runtime of 123ms for this query on master/HEAD, which varies
+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
this patch, that goes down to 107ms +/- 1ms at -O0. I think that
that's a pretty good start. Funnily enough, the difference/advantage
vanishes at -O2 (I'm guessing that the higher optimisation level of
GCC 4.5 hyper-corrects away the inlining, but I don't have time to
check that right now).
Considering that -O2 is our standard optimization level, that
observation seems to translate to "this patch will be useless in
practice". I think you had better investigate that aspect in some
detail before spending more effort.
This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.
Really? N copies of the same code could lead to performance loss just
due to code bloat (ie, less of a query's inner loops fitting in CPU
cache). Not to mention the clear regression in maintainability. So
I'm disinclined to consider this sort of change without a significantly
bigger win than you're suggesting above (no, I don't even consider the
-O0 number attractive, let alone what you're finding at -O2).
regards, tom lane
Recent discussions on the threads "Double sorting split patch" and
"CUDA sorting" raised the possibility that there could be significant
performance optimisation "low-hanging fruit" picked by having the
executor treat integers and floats as a special case during sorting,
avoiding going to the trouble of calling a comparator using the
built-in SQL function machinery
Why only for integers and floats why not for char/varchar?
But I believe this can make code less maintainable as similar things can be
done at other places to avoid SQL function machinery.
Once the cache has been warmed, explain analyze very consistently
reports a runtime of 123ms for this query on master/HEAD, which varies
+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
this patch, that goes down to 107ms +/- 1ms at -O0.
Time 123ms which is without your change is with which optimization -O2 or
O0?
****************************************************************************
***********
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Geoghegan
Sent: Tuesday, September 20, 2011 7:26 AM
To: PG Hackers
Subject: [HACKERS] Inlining comparators as a performance optimisation
Recent discussions on the threads "Double sorting split patch" and
"CUDA sorting" raised the possibility that there could be significant
performance optimisation "low-hanging fruit" picked by having the
executor treat integers and floats as a special case during sorting,
avoiding going to the trouble of calling a comparator using the
built-in SQL function machinery, and taking advantage of inlining of
the comparator, which has been shown to have a considerable
performance advantage (at least compared to a general purpose c stdlib
qsort(), that takes a function pointer as its comparator, much like
tuplesort).
I've hacked together a sloppy POC implementation in a hurry
(basically, some code is shifted around) , which is attached - I felt
that it would be useful to determine if the community feels that this
is a worth-while undertaking in advance of a business trip that I'm
leaving on tomorrow lasting until Friday, during which I will be
mostly unavailable. The patch breaks the Postgres sorting executor
node (at least when it quicksorts) for any type other than int4. I
apologise for how rough the patch is, but the code itself isn't
important right now - the idea is. I anticipate that the value
state->datumType or something similar will be set in a near future
revision, so that tuplesort_performsort will know which type-specific
optimisation it can use for the type, while falling back on the
existing generic qsort_arg + qsort_arg_comparator, and sorting won't
actually be broken.
I've been doing some preliminary testing using the dell store 2 sample
database. I increase work_mem to '50MB', to ensure that a quicksort
will be performed for sorting (otherwise, I'm using the
postgresql.conf that initdb gave me). The query is:
explain analyze select * from orderlines order by prod_id;
Once the cache has been warmed, explain analyze very consistently
reports a runtime of 123ms for this query on master/HEAD, which varies
+/- 1 ms, with a few outliers of maybe +/- 2ms. However, when I apply
this patch, that goes down to 107ms +/- 1ms at -O0. I think that
that's a pretty good start. Funnily enough, the difference/advantage
vanishes at -O2 (I'm guessing that the higher optimisation level of
GCC 4.5 hyper-corrects away the inlining, but I don't have time to
check that right now).
I imagine the version that I actually submit for patch review will
have a macro-based infrastructure for inlining the sorting of various
built-in types, initially integers and floats. It will most likely
have some other optimisations - I haven't even used a profiler yet.
This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 20 September 2011 03:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Considering that -O2 is our standard optimization level, that
observation seems to translate to "this patch will be useless in
practice". I think you had better investigate that aspect in some
detail before spending more effort.
I don't think that the fact that that happens is at all significant at
this early stage, and it never even occurred to me that you'd think
that it might be. I was simply disclosing a quirk of this POC patch.
The workaround is probably to use a macro instead. For the benefit of
those that didn't follow the other threads, the macro-based qsort
implementation, which I found to perform significantly better than
regular qsort(), runs like this on my laptop when I built at 02 with
GCC 4.6 just now:
C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 seconds
Does *that* look attractive to you? I've attached source code of the
program that produced these figures, which has been ported to C from
C++.
When I #define LARGE_SIZE 100000000, here's what I see:
[peter@peter inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 23.659411 seconds
Inline quick-sort time elapsed: 18.470611 seconds
Here, sorting with the function pointer/stdlib version takes about
1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
it took about 1.32 times as long. Fairly predictable, linear, and not
to be sniffed at.
The variance I'm seeing across runs is low - a couple of hundredths of
a second at most. This is a Fedora 15 " Intel(R) Core(TM) i5-2540M CPU
@ 2.60GHz" machine. I'm not sure right now why the inline quick-sort
is less of a win than on my old Fedora 14 desktop (where it was 3.24
Vs 2.01), but it's still a significant win. Perhaps others can build
this simple program and tell me what they come up with.
This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.Really? N copies of the same code could lead to performance loss just
due to code bloat (ie, less of a query's inner loops fitting in CPU
cache).
I did consider that. Of course inlining has an overhead, and I'll be
testing that each instance of inlining has a net benefit. I just meant
that many other performance patches have an obvious worst case, and I
think that it is policy to focus on that case, but I can't think of
one here. Does anyone else have any ideas?
Not to mention the clear regression in maintainability. So
I'm disinclined to consider this sort of change without a significantly
bigger win than you're suggesting above
Sure, there'll be some sort of regression in maintainability - I think
that HOT had a clear regression in maintainability too. The important
questions are obviously "how big is the loss of maintainability?", and
"is it worth it?". We'll know more when this work is actually shaped
into a proper patch. Perhaps I should have waited until I had
something along those lines before making an announcement, but I
wanted community input as early as possible. I think that there's
plenty of tweaking that can be done to get additional performance
improvements - all I've done so far is demonstrate that those
improvements are real and worth thinking about, in the fastest
possible way, partly because you expressed skepticism of the benefits
of inlining comparators to Greg Stark in an earlier thread.
Performance and maintainability are often somewhat in tension, but we
cannot ignore performance. If this work can bring us an improvement in
performance approaching the isolated macro Vs qsort() function pointer
benchmark, that's a *big* win. Sorting integers and floats is very
common and important.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
----- Цитат от Peter Geoghegan (peter@2ndquadrant.com), на 21.09.2011 в 02:53 -----
On 20 September 2011 03:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Considering that -O2 is our standard optimization level, that
observation seems to translate to "this patch will be useless in
practice". I think you had better investigate that aspect in some
detail before spending more effort.I don't think that the fact that that happens is at all significant at
this early stage, and it never even occurred to me that you'd think
that it might be. I was simply disclosing a quirk of this POC patch.
The workaround is probably to use a macro instead. For the benefit of
those that didn't follow the other threads, the macro-based qsort
implementation, which I found to perform significantly better than
regular qsort(), runs like this on my laptop when I built at 02 with
GCC 4.6 just now:C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 secondsDoes *that* look attractive to you? I've attached source code of the
program that produced these figures, which has been ported to C from
C++.When I #define LARGE_SIZE 100000000, here's what I see:
[peter@peter inline_compar_test]$ ./a.out
C stdlib quick-sort time elapsed: 23.659411 seconds
Inline quick-sort time elapsed: 18.470611 secondsHere, sorting with the function pointer/stdlib version takes about
1.28 times as long. In the prior test (with the smaller LARGE_SIZE),
it took about 1.32 times as long. Fairly predictable, linear, and not
to be sniffed at.The variance I'm seeing across runs is low - a couple of hundredths of
a second at most. This is a Fedora 15 " Intel(R) Core(TM) i5-2540M CPU
@ 2.60GHz" machine. I'm not sure right now why the inline quick-sort
is less of a win than on my old Fedora 14 desktop (where it was 3.24
Vs 2.01), but it's still a significant win. Perhaps others can build
this simple program and tell me what they come up with.
Run it here.
Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz
gcc version 4.6.1 (Debian 4.6.1-10)
g++ -O2 qsort-inline-benchmark.c
./a.out
C stdlib quick-sort time elapsed: 1.942686 seconds
Inline quick-sort time elapsed: 1.126508 seconds
With #define LARGE_SIZE 100000000
C stdlib quick-sort time elapsed: 22.158207 seconds
Inline quick-sort time elapsed: 12.861018 seconds
with g++ -O0
C stdlib quick-sort time elapsed: 2.736360 seconds
Inline quick-sort time elapsed: 2.045619 seconds
On server hardware:
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
gcc version 4.4.5 (Debian 4.4.5-8)
/a.out
C stdlib quick-sort time elapsed: 2.610150 seconds
Inline quick-sort time elapsed: 1.494198 seconds
All -O2 version show 42% speedup with inlined qsort.
-O0 showed 25% speedup.
Best regards
--
Luben Karavelov
Import Notes
Resolved by subject fallback
This attempts to be as simple as it gets while reducing function call
depth, and should be viewed as a proof of concept. It is also untested
as of now, but will try to do that and report back.
I'm hoping I followed the rabbit hole correctly and are correctly
comparing the right pointers to each other in order to short circuit the
case where we are using the int4 comparison operator.
Peter, if you want to compare stock vs. your patch vs. this patch, we might
be able to get some sort of read on where the maintainablity vs. performance
curve lies. Note that this version should still allow sorting of anything,
and simply shifts gears for int4 tuples...
---
src/backend/utils/sort/tuplesort.c | 23 +++++++++++++++++++++--
1 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..ddd5ced 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2652,6 +2652,22 @@ myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
return result;
}
+static inline
+int int4cmp(Datum first, Datum second)
+{
+ int32 a = DatumGetInt32(first);
+ int32 b = DatumGetInt32(second);
+
+ if (a > b)
+ return 1;
+ else if (a == b)
+ return 0;
+ else
+ return -1;
+}
+
+extern Datum btint4cmp(PG_FUNCTION_ARGS);
+
/*
* Apply a sort function (by now converted to fmgr lookup form)
* and return a 3-way comparison result. This takes care of handling
@@ -2683,8 +2699,11 @@ inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
}
else
{
- compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
- datum1, datum2));
+ if (sortFunction->fn_addr == btint4cmp)
+ compare = int4cmp(datum1, datum2);
+ else
+ compare = DatumGetInt32(myFunctionCall2Coll(sortFunction, collation,
+ datum1, datum2));
if (sk_flags & SK_BT_DESC)
compare = -compare;
--
1.7.6.3
On 21.09.2011 02:53, Peter Geoghegan wrote:
C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 secondsDoes *that* look attractive to you?
Not really, to be honest. That's a 25% speedup in pure qsorting speed.
How much of a gain in a real query do you expect to get from that, in
the best case? There's so many other sources of overhead that I'm afraid
this will be lost in the noise. If you find a query that spends, say,
50% of its time in qsort(), you will only get a 12.5% speedup on that
query. And even 50% is really pushing it - I challenge you to find a
query that spends any significant amount of time qsorting integers.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 20, 2011 at 3:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This performance patch differs from most in that it's difficult in
principle to imagine a performance regression occurring.Really? N copies of the same code could lead to performance loss just
due to code bloat (ie, less of a query's inner loops fitting in CPU
cache). Not to mention the clear regression in maintainability. So
I'm disinclined to consider this sort of change without a significantly
bigger win than you're suggesting above (no, I don't even consider the
-O0 number attractive, let alone what you're finding at -O2).
More copies of the code are somewhat annoying, but its only 100 lines
of code in one module and we can easily have specific tests for each.
The extra code size is minor in comparison to the reams of code we add
elsewhere.
It's a surprisingly good win for such a common use case. Well done, Peter.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 21.09.2011 02:53, Peter Geoghegan wrote:
C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 secondsDoes *that* look attractive to you?
Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
much of a gain in a real query do you expect to get from that, in the best
case? There's so many other sources of overhead that I'm afraid this will be
lost in the noise. If you find a query that spends, say, 50% of its time in
qsort(), you will only get a 12.5% speedup on that query. And even 50% is
really pushing it - I challenge you to find a query that spends any
significant amount of time qsorting integers.
How about almost every primary index creation?
Don't really see a reason for the negativity here. If you use that
argument no performance gain is worth it because all workloads are
mixed.
This is a marvellous win, a huge gain from a small, isolated and
easily tested change. By far the smallest amount of additional code to
sorting we will have added and yet one of the best gains.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 21.09.2011 10:01, Simon Riggs wrote:
On Wed, Sep 21, 2011 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 21.09.2011 02:53, Peter Geoghegan wrote:
C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 secondsDoes *that* look attractive to you?
Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
much of a gain in a real query do you expect to get from that, in the best
case? There's so many other sources of overhead that I'm afraid this will be
lost in the noise. If you find a query that spends, say, 50% of its time in
qsort(), you will only get a 12.5% speedup on that query. And even 50% is
really pushing it - I challenge you to find a query that spends any
significant amount of time qsorting integers.How about almost every primary index creation?
Nope. Swamped by everything else.
Also note that as soon as the sort grows big enough to not fit in
maintenance_work_mem, you switch to the external sort algorithm, which
doesn't use qsort. Perhaps you could do similar inlining in the heap
sort & merge passes done in the external sort, but it's unlikely to be
as big a win there.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 21 September 2011 01:48, <karavelov@mail.bg> wrote:
All -O2 version show 42% speedup with inlined qsort.
-O0 showed 25% speedup.
Thanks. Looks like the figures I posted last night were fairly
conservative. Does anyone else care to report results?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
How about almost every primary index creation?
Nope. Swamped by everything else.
Really? I think it's pretty common for shops to be able to dedicate
large amounts of RAM to building initial indexes on data loads or
reindex operations. Enough that they can cache the entire table for
the short time they're doing the index builds even if they're quite
large. Witness the recent pleas to allow maintenance_work_mem on the
order of tens of gigabytes. And it's also pretty common that shops can
dedicate very large I/O bandwidth, in many cases enough to saturate
the memory bandwidth, for doing these kinds of batch operations when
they get large enough to need to do an external sort.
There's still overhead of reading the pages, the tuples, finding the
sort keys in the tuple, etc. But I think the actual qsort or heap
operations in tapesort are pretty big portions of the work.
This is pretty easy to measure. Just run oprofile or gprof and see
what percentage of time for a big index build is spent in qsort.
--
greg
On Wed, Sep 21, 2011 at 8:47 AM, Greg Stark <stark@mit.edu> wrote:
On Wed, Sep 21, 2011 at 8:08 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:How about almost every primary index creation?
Nope. Swamped by everything else.
Really? I think it's pretty common for shops to be able to dedicate
large amounts of RAM to building initial indexes on data loads or
reindex operations. Enough that they can cache the entire table for
the short time they're doing the index builds even if they're quite
large. Witness the recent pleas to allow maintenance_work_mem on the
order of tens of gigabytes. And it's also pretty common that shops can
dedicate very large I/O bandwidth, in many cases enough to saturate
the memory bandwidth, for doing these kinds of batch operations when
they get large enough to need to do an external sort.There's still overhead of reading the pages, the tuples, finding the
sort keys in the tuple, etc. But I think the actual qsort or heap
operations in tapesort are pretty big portions of the work.This is pretty easy to measure. Just run oprofile or gprof and see
what percentage of time for a big index build is spent in qsort.
+1 for some actual measurements.
I don't think anyone on this thread is saying that if we can get big
performance gains from doing this we still shouldn't do it. But at
this point it's unclear that we can get a consistent speedup that
isn't heavily dependent on the choice of compiler flags (to say
nothing of compiler and OS), and even if we can, it's not clear that
it will still be noticeable when you measure the run time of an entire
query rather than just the speed of qsort(). Like Tom and Heikki, I'm
a bit skeptical: it wouldn't surprise me to find out that qsort() is
5% of the runtime any realistic test case and the average qsort()
speedup based on tests on a couple different platforms is 10% and so
on average we're looking at a 0.5% improvement, in which case it might
be more trouble than it's worth, especially if it turns out that there
are OS/platform combinations where the inlined version is (for some
crazy reason) slower. I've seen performance differences of up to 3%
from minor code rearrangements that don't seem like they should matter
at all, just because code and data shifts around across cache-line
boundaries and the new arrangement is slightly better or worse than
the old one. So if the performance improvement turns out to be very
small, then validating that it actually IS an improvement in general
is likely to be kind of a pain in the ass.
On the other hand, the performance improvement might turn out to be
large. Maybe there's a test case where, as Heikki suggests, 50% of
the time is spent in qsort(). If we can reliably make that 25%
faster, I wouldn't dismiss that out of hand; I think that would be
pretty good, assuming it didn't require massive amounts of spaghetti
code to make it work. I don't see that that would be any more
marginal than the sorts of things we've optimized in, say, commit
4fc115b2e981f8c63165ca86a23215380a3fda66, or commit
f4d242ef94730c447d87b9840a40b0ec3371fe0f.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 21 September 2011 07:51, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 21.09.2011 02:53, Peter Geoghegan wrote:
C stdlib quick-sort time elapsed: 2.092451 seconds
Inline quick-sort time elapsed: 1.587651 secondsDoes *that* look attractive to you?
Not really, to be honest. That's a 25% speedup in pure qsorting speed. How
much of a gain in a real query do you expect to get from that, in the best
case? There's so many other sources of overhead that I'm afraid this will be
lost in the noise.
I'm surprised that you're dismissive of this. After all, we have in
the past indulged in micro-optimisation of qsort, or so it would seem
from this comment:
* We have modified their original by adding a check for already-sorted input,
* which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
"Makes affected queries radically faster" (In the best case, a speedup
somewhat greater than 12.5%) is an unreasonably high standard for a
performance optimisation of the executor in general (such a high
standard might be sensible if it was due to a particular
maintainability downside, but you didn't mention one). Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.
Not only are we benefiting from the effects of inlining, we're also
benefiting from the removal of unnecessary indirection. As Tom said,
"In concrete terms, there would be no reason to have tuplesort.c's
myFunctionCall2Coll, and maybe not inlineApplySortFunction either, if
the datatype-specific comparison functions had APIs that were closer
to what sorting wants rather than following the general
SQL-callable-function API." He was just referring to the benefits of
removing indirection here, so ISTM that this is really two performance
optimisations rolled into one - it's conceivable that the total
performance improvement will even exceed the isolated inlining
comparator benchmark.
As I've said, I believe this patch can be committed without
compromising the maintainability of the tuplesort code to an extent
that is not clearly worth it, through the use of a clean, macro-based
abstraction. Concerns about bloated binaries are probably not well
founded, because what I'm proposing is to a certain extent emulating
C++ templates, while using a very common pattern used with C++
templates. In the C++ world, algorithms are often generalised as
templates, so that they can be used equally well with any datatype
(that supports the interface of the template), while availing of
compiler optimisations per template instantiation (instance of using a
given type with a given template). I actually got the idea for this
patch in part from a book that I read years ago that described the
fact that counter-intuitively, std::sort() consistently outperforms
qsort(), because the comparator is often inlined, and the compiler can
generally avail of optimisations from knowing the comparator at
compile-time.
On 21 September 2011 13:47, Greg Stark <stark@mit.edu> wrote:
This is pretty easy to measure. Just run oprofile or gprof and see
what percentage of time for a big index build is spent in qsort.
I'll do so soon. I intend to get to this on Friday evening, and
perhaps have a proper patch to show next week.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 21.09.2011 17:20, Peter Geoghegan wrote:
Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.
I'm not against making things faster, it's just that I haven't seen
solid evidence yet that this will help. Just provide a best-case test
case for this that shows a huge improvement, and I'll shut up. If the
improvement is only modest, then let's discuss how big it is and whether
it's worth the code ugliness this causes.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 21.09.2011 17:20, Peter Geoghegan wrote:
Even still, I
think that the 12.5% figure is pretty pessimistic - I've already sped
up the dell store query by almost that much, and that's with a patch
that was, due to circumstances, cobbled together.
I'm not against making things faster, it's just that I haven't seen
solid evidence yet that this will help. Just provide a best-case test
case for this that shows a huge improvement, and I'll shut up. If the
improvement is only modest, then let's discuss how big it is and whether
it's worth the code ugliness this causes.
The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint. I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.
regards, tom lane
On 09/21/2011 10:50 AM, Tom Lane wrote:
The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint. I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.
Can we get a patch so we can do benchmarks on this?
cheers
andrew
Simon Riggs <simon@2ndQuadrant.com> writes:
This is a marvellous win, a huge gain from a small, isolated and
easily tested change. By far the smallest amount of additional code to
sorting we will have added and yet one of the best gains.
I think you forgot your cheerleader uniform. A patch along these lines
is not going to be small, isolated, easily maintained, nor beneficial
for any but a small number of predetermined datatypes.
regards, tom lane
On 21 September 2011 15:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
I'm not against making things faster, it's just that I haven't seen
solid evidence yet that this will help. Just provide a best-case test
case for this that shows a huge improvement, and I'll shut up. If the
improvement is only modest, then let's discuss how big it is and whether
it's worth the code ugliness this causes.
Fair enough.
The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint.
That's a reasonable question, and I hope to be able to come up with a
good answer.
I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.
I'm fairly surprised that your view on that is mostly or entirely
unchanged, even after I've demonstrated a considerable performance
advantage from a macro-based qsort implementation over my OS vendor's
c std lib qsort(), using an isolated test-case, that does not have
anything to do with that impedance mismatch. I'm not sure why you
doubt that the same thing is happening within tuplesort.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Andrew Dunstan <andrew@dunslane.net> writes:
On 09/21/2011 10:50 AM, Tom Lane wrote:
The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint. I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.
Can we get a patch so we can do benchmarks on this?
Well, we'd have to negotiate what the API ought to be. What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable. As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.
The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function, so that the first comparison in a given
sort run would be done the traditional way but then we'd notice the
provided function pointer and start using that. It would not be too
hard to pass back the pointer using FunctionCallInfoData.context, say.
The downside is adding cycles to unoptimized cases to uselessly check
for a returned function pointer that's not there. Perhaps it could be
hacked so that we only add cycles to the very first call, but I've not
looked closely at the code to see what would be involved.
Has anyone got a better idea for getting hold of the alternate function?
regards, tom lane
On 21.09.2011 18:46, Tom Lane wrote:
Andrew Dunstan<andrew@dunslane.net> writes:
On 09/21/2011 10:50 AM, Tom Lane wrote:
The other question that I'm going to be asking is whether it's not
possible to get most of the same improvement with a much smaller code
footprint. I continue to suspect that getting rid of the SQL function
impedance-match layer (myFunctionCall2Coll etc) would provide most of
whatever gain is to be had here, without nearly as large a cost in code
size and maintainability, and with the extra benefit that the speedup
would also be available to non-core datatypes.Can we get a patch so we can do benchmarks on this?
Well, we'd have to negotiate what the API ought to be. What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable. As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function, so that the first comparison in a given
sort run would be done the traditional way but then we'd notice the
provided function pointer and start using that. It would not be too
hard to pass back the pointer using FunctionCallInfoData.context, say.
The downside is adding cycles to unoptimized cases to uselessly check
for a returned function pointer that's not there. Perhaps it could be
hacked so that we only add cycles to the very first call, but I've not
looked closely at the code to see what would be involved.
You could have a new function with a pg_proc entry, that just returns a
function pointer to the qsort-callback.
Or maybe the interface should be an even more radical replacement of
qsort, not just the comparison function. Instead of calling qsort,
tuplesort.c would call the new datatype-specific sort-function (which
would be in pg_proc). The implementation could use an inlined version of
qsort, like Peter is suggesting, or it could do something completely
different, like a radix sort or a GPU-assisted sort or whatever.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.
Why couldn't they be in pg_proc with a bunch of opaque arguments like
the GIST opclass support functions?
I'm a bit puzzled what the arguments would look like. They would still
need to know the collation, nulls first/last flags, etc.
And calling it would still not be inlinable. So they would have to
check those flags on each invocation instead of having a piece of
straightline code that hard codes the behaviour with the right
behaviour inline. ISTM the hope for a speedup from the inlining
mostly came from the idea that the compiler might be able to hoist
this logic outside the loop (and I suppose implement n specialized
loops depending on the behaviour needed).
--
greg
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 21.09.2011 18:46, Tom Lane wrote:
The idea that I was toying with was to allow the regular SQL-callable
comparison function to somehow return a function pointer to the
alternate comparison function,
You could have a new function with a pg_proc entry, that just returns a
function pointer to the qsort-callback.
Yeah, possibly. That would be a much more invasive change, but cleaner
in some sense. I'm not really prepared to do all the legwork involved
in that just to get to a performance-testable patch though.
Or maybe the interface should be an even more radical replacement of
qsort, not just the comparison function. Instead of calling qsort,
tuplesort.c would call the new datatype-specific sort-function (which
would be in pg_proc). The implementation could use an inlined version of
qsort, like Peter is suggesting, or it could do something completely
different, like a radix sort or a GPU-assisted sort or whatever.
No. In the first place, that only helps for in-memory sorts. In the
second, it would absolutely destroy our ability to change the behavior
of sorting ever again. Considering that we've added ASC/DESC, NULLS
FIRST/LAST, and collation support over the years, are you really
prepared to bet that the sort code will never need any more feature
upgrades? (This concern is in fact the source of my beef with the whole
inlining proposal to begin with, but allowing the inlining to occur into
third-party code that we don't control at all would be a hundred times
worse.)
regards, tom lane
On 21.09.2011 18:46, Tom Lane wrote:
Well, we'd have to negotiate what the API ought to be. What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable. As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.
Quite aside from this qsort-thing, it would be nice to have versions of
all simple functions that could be called without the FunctionCall
overhead. So instead of:
FunctionCall2(&flinfo_for_int4pl, 1, 2)
you could do simply
int4pl_fastpath(1,2)
I'm not sure how big an effect this would have, but it seems like it
could shave some cycles across the system.
We could have an extended version of the PG_FUNCTION_INFO_V1 macro that
would let you register the fastpath function:
PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Greg Stark <stark@mit.edu> writes:
On Wed, Sep 21, 2011 at 4:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
�As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.
Why couldn't they be in pg_proc with a bunch of opaque arguments like
the GIST opclass support functions?
That does not mean the same thing at all. Everything in pg_proc is
meant to be called through the V0 or V1 function call info protocols.
I'm a bit puzzled what the arguments would look like. They would still
need to know the collation, nulls first/last flags, etc.
No, I wasn't thinking that we should do that. The datatype comparison
functions should have the exact same semantics they do now, just a
lower-overhead call mechanism. If you try to push stuff like NULLS
FIRST/LAST into the per-datatype code, then you are up against a problem
when you want to add a new flag: you have to touch lots of code not all
of which you even control.
And calling it would still not be inlinable. So they would have to
check those flags on each invocation instead of having a piece of
straightline code that hard codes the behaviour with the right
behaviour inline. ISTM the hope for a speedup from the inlining
mostly came from the idea that the compiler might be able to hoist
this logic outside the loop (and I suppose implement n specialized
loops depending on the behaviour needed).
None of that stuff is inlinable or constant-foldable today, nor would it
be with the patch that Peter was proposing AFAICS, because none of the
flags will ever be compile time constant values.
regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 21.09.2011 18:46, Tom Lane wrote:
Well, we'd have to negotiate what the API ought to be. What I'm
envisioning is that datatypes could provide alternate comparison
functions that are designed to be qsort-callable rather than
SQL-callable. As such, they could not have entries in pg_proc, so
it seems like there's no ready way to represent them in the catalogs.
Quite aside from this qsort-thing, it would be nice to have versions of
all simple functions that could be called without the FunctionCall
overhead.
Hmm, that's an interesting idea. I think probably the important aspects
are (1) known number of arguments and (2) no null argument or result
values are allowed. Not sure what we'd do with collations though.
We could have an extended version of the PG_FUNCTION_INFO_V1 macro that
would let you register the fastpath function:
PG_FUNCTION_INFO_V1(int4pl, int4pl_fastpath);
We don't use PG_FUNCTION_INFO_V1 for built-in functions ...
regards, tom lane
On Wed, Sep 21, 2011 at 5:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
None of that stuff is inlinable or constant-foldable today, nor would it
be with the patch that Peter was proposing AFAICS, because none of the
flags will ever be compile time constant values.
I was referring to transformations like this which I believe compilers
are already capable of doing:
v = ...;
while (...)
if (v) {
if (a < b) ... else ....;
} else {
if (a > b) ... else ...;
}
turning it into code that looks like:
if (v) {
while (....)
if (a<b) ... else ...;
} else {
while (....)
if (a>b) ... else ...;
}
which may not look like much -- especially with branch prediction --
but then it's much more likely to be able to unroll the loop and do
clever instruction scheduling and so on than if there's an extra
branch in the middle of the loop. But if there's a function call to an
external function called through a function pointer in the middle of
the loop then the whole endeavour would be for naught.
--
greg
On Wed, Sep 21, 2011 at 4:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
This is a marvellous win, a huge gain from a small, isolated and
easily tested change. By far the smallest amount of additional code to
sorting we will have added and yet one of the best gains.I think you forgot your cheerleader uniform.
LOL. I'm happy whoever and whenever we get large wins like that.
Go Postgres!
A patch along these lines
is not going to be small, isolated, easily maintained, nor beneficial
for any but a small number of predetermined datatypes.
That was the starting premise.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
I've produced something much neater than my first patch, attached,
although I still consider this to be at the POC stage, not least since
I'm not exactly sure how I should be selecting the right
specialisation in tuplesort_performsort() (a hack is used right now
that does a direct pg_proc OID comparison), and also because I haven't
implemented anything other than qsort for heap tuples yet (a minor
detail, I suppose). I'm pleased to say that a much clearer picture of
what's really going on here has emerged.
Statistics: Total runtime according to explain analyze for query
"select * from orderlines order by prod_id" (dellstore2 sample db), at
GCC 4.5's -02 optimisation level, after warming the cache, on my
desktop:
Without the patch:
~82ms
With the patch, but with the "inline" keyword commented out for all
new functions/meta-functions:
~60ms
with the patch, unmodified:
~52ms
Recent experience suggests that using a macro rather than an inline
function would perhaps have been a mistake, as it would prevent us
from benefiting from the compiler's smarts in surmising just where we
should inline. As I've pointed out, individual call sites are inlined,
not individual functions.
Tom wanted to know if I could prove the benefit in inlining, as
against just resolving the impedance mismatch without bothering with
inlining (and, indeed, further optimisations that the compiler can
avail of as a result of knowing the comparator at compile-time). These
figures support my contention that these optimisations have an
important role to play. However, there's no question that resolving
the impedance mismatch is where the biggest benefit is seen,
particularly relative to maintainability. As Tom anticipated, the
question of whether or not we do the inlining is less than entirely
straightforward, as it's less of a win, and it has a higher price in
ugliness. I myself am very much of the opinion that it's still worth
it. As I've said, sorting integers and floats is very common and very
important, and this approach will likely yield benefits beyond
increasing the speed of executor sort nodes, as described below.
I accept that a more sophisticated benchmark is required here, but
I've been a little busy actually writing the patch. Any ideas anyone?
Independent benchmarks are welcome. If someone could suggest a worst
case, that would be particularly useful, as I cannot think of one - I
believe that each instance of inlining a call site more-or-less either
is or is not a net benefit here, regardless of the number of
comparisons performed, which is why the improvement is so predictable
across the size of the set of integers sorted for by qsort in
isolation (which is the big reason why that ~8ms decrease due to
inlining turns out to be not too shabby - it's a saving per
comparison, and they can add it pretty quickly). So, for example, when
I build the patch on my laptop (GCC 4.6), with an 48MB table (the same
orderlines table query, but I doubled up the data a few times
beforehand), we see an undiminished, proportional (to the prior,
isolated cost of qsorting, I think) decrease in runtime:
Without patch:
~615ms
With patch:
~415ms
One key insight that this work brings is that resolving the impedance
mismatch - making comparisons as inexpensive as possible (including
using inlining) - is the best possible strategy in improving sort
performance today, vastly more efficacious than, for example, tweaking
the sorting algorithm. If anyone can find some more ways of shaving
cycles there, it's one place where it really matters.
Changes are isolated to the extent that if you decide that you don't
like this one day, you can simply remove the calls to qsort_arg
specialisations, and once again switch back to just using what the
patch makes a fallback, qsort_arg itself. I know that we'd never
really get into that situation, but the fact that we could do that
serves to illustrate that these changes are fairly isolated.
Incidentally, if you find writing code that is heavily dependent on
macros to be a chore, I can highly recommend Clang 2.9 .
But what of the maintenance burden of mostly duplicating qsort_arg.c,
to produce a "template" version? Well, take a look at the *entire*
history for that file:
[peter@localhost port]$ git log qsort_arg.c
commit 9f2e211386931f7aee48ffbc2fcaef1632d8329f
Author: Magnus Hagander <magnus@hagander.net>
Date: Mon Sep 20 22:08:53 2010 +0200
Remove cvs keywords from all files.
commit b9954fbb4ef25fb1ea173d26017d4d128dd15be5
Author: Neil Conway <neilc@samurai.com>
Date: Sun Mar 18 05:36:50 2007 +0000
Code cleanup for function prototypes: change two K&R-style prototypes
to ANSI-style, and change "()" -> "(void)". Patch from Stefan Huehner.
commit b38900c7677657a815e75781b776fb1e41054df3
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu Oct 12 15:04:55 2006 +0000
Use Min() instead of min() in qsort, for consistency and to avoid
redefined-macro warnings on some platforms. Per gripe from Hiroshi Saito.
commit f99a569a2ee3763b4ae174e81250c95ca0fdcbb6
Author: Bruce Momjian <bruce@momjian.us>
Date: Wed Oct 4 00:30:14 2006 +0000
pgindent run for 8.2.
commit 6edd2b4a91bda90b7f0290203bf5c88a8a8504db
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Tue Oct 3 22:18:23 2006 +0000
Switch over to using our own qsort() all the time, as has been proposed
repeatedly. ***SNIP MESSAGE***
I think that it is fair to say that the maintenance burden imposed by
this change is well worth it. Only one of these changes after the
initial commit is not mechanical, and that is still pretty trivial.
I've attached gprof flat profile output from unmodified PostgreSQL (at
master) when we create an index on a pgbench table:
pgbench -i -s 1500 index_test
That puts the number of rows in the pgbench_accounts table at 150
million, while the table is 19 GB in size. That’s a reasonable size,
and should usefully demonstrate the likely improvement we’ll see in
the performance of creating an index.
I increased maintenance_work_mem to 756MB, plus used a fairly
straightforward postgresql.conf, plus some other, more generic values
for other GUCs. The query is:
create INDEX on pgbench_accounts(abalance); -- abalance is of type integer
Here is the top of the flat profile, by far the most important part:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
43.56 79.29 79.29 6972025063 0.00 0.00 comparetup_index_btree
22.39 120.04 40.75 150000000 0.00 0.00 tuplesort_heap_siftup
4.36 127.97 7.93 2677057771 0.00 0.00 btint4cmp
2.88 133.21 5.24 314766350 0.00 0.00 LWLockRelease
1.81 136.50 3.29 314612660 0.00 0.00 LWLockAcquire
1.65 139.50 3.00 450905247 0.00 0.00 AllocSetAlloc
1.43 142.10 2.60 300000001 0.00 0.00 LogicalTapeWrite
1.17 144.23 2.13 comparetup_cluster
*** SNIP ***
This looks pretty top-heavy to me, and more time is spent in
comparetup_index_btree than any other function by some margin,
suggesting that it will be worthwhile to pursue similar optimisations
to make index creation faster in a later revision of this patch, or as
another patch. I didn't take care to ensure that I'd be caching the
entire table in memory, which probably would have been more useful
here.
Thoughts?
On the subject of highly ambitious optimisations to sorting, one
possibility I consider much more practicable than GPU-accelerated
sorting is simple threading; quicksort can be parallelised very
effectively, due to its divide-and-conquer nature. If we could agree
on a threading abstraction more sophisticated than forking, it's
something I'd be interested in looking at. To do so would obviously
entail lots of discussion about how that relates to whatever way we
eventually decide on implementing parallel query, and that's obviously
a difficult discussion.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
createindex.out.gzapplication/x-gzip; name=createindex.out.gzDownload
�xN createindex.out ��Is�8������5��27[�]������c.& ��f�i�)Y���=���$;#�+Ke|����~We�uj���b�����,����x���7���������X����v�,��,����3�-�K��;k��7}V���+�/�b�-�YyVU�G���f_������B�����L_�$rm7�C��W��=��7�S���|J��`?�}�2��r���!���o���� l��h��X��}z��S��w�X����H�r�x�F�DQ�P�}_��������<o��?�\��?
C/ ����M��+�X��3Y�.v$�
#<\"��������s�
HH2A�������G�uU5�-��?9��������������%��Cs(a��e'�w[�b2N$8�����Y[�������Y+@����''dGb�AzjB]��i�������� [l��\!��. �n�R�%;��Eq���+�����������U�����"C<O C����&1�@�=�^�����>�� A`�E_YV�����E����m=ja���
�pgK�g?K��cm�a��� ��G����a���J��A���VD����Pi����H�0�q�y� �O��\VEV��
���@��=X)0.�2r��}���!n!��|��!H��
� 6�����
!*�;#��������W�:��B��9qBox*�bg{��=_�+��,��1��$����"�'���'N�|X6>�7�D�6�j�%�y��@�a�.5����?���78�<��>5��)lL�~
A��H�'����sM����9|�����/���6�)���P�!�NV�v&oNOK����,�7r�!�h��/�����vM!�|]!��f�;8��F�c����}����}��oEhp4*�P��KWac��>�X����c����Yu��$��;,�Lx�O�k���w\8C�?N�JW#<�%B�x�YZd}��?bFb�����A7��K��Gvl�����A����H��n�$�t���i)�p������������ !����b1Jq� ���J�M�L�L_&T�L(Cw{��}VWhN�l93or��s�!�rz�xwW���,?�v��r#�p������x����6SW�,�.���eV��d�����Ue�>��b��������]�g�����~{�r1G'�0�
f��R����u�_��i�#gu����$(��G�n\o��������z4_�\)��p����L�"d�7Qo����>����fK32FAp�m�i����!�b��z.�mie~]��� S����J�;� m��3�r�����r��]���m�lI���-���tHpl� �3:�&g1C�G�-g��dE�������h����u5FrkF�?���w�VfL_���c�J�Iv��pG��d�3�I]
q2)�����H7��$(Dz,�
�P;�^2G�Mw�Y����,X{&�*�.����.�Z@v����a{��/�:��z��$PR!��I��we�^�9��]
% ���\���@���)W$����/2�C��B$��'X���
z�Z�B[��X.����L�0�m�����L'������S�T+�({(����M.0&�B��!����M��pc�[�������}��L�k�^�7���
5�������a/W�o��o!y\��a�T3���PI�:0�+���4W�:�bu��mBL�n�x���+;ee���G��S���6��PX�#c��dr��0Uz��I�>��
qS\5��{��@^�}�������`�����������>�j���JZ���9��X��1����m��"s�������n� Z^��;��%L��Mu�7��k���p�B�H���Oe=������s����r?�uu��%K��s�,B� �zf��=@��
�c9j�P4Gs�&�O�q����Q@�v�����d1��D9|�4N�����6��m�������kF0c�k��LQ����hnd��x���j���5��%��IYSR�]���%��3�pF���.��s��9�I��48�6��o�`��h|���
jaUO^��j�$f�o�3��f����q���!7
�M>,�{����xd�~�R�'�|i-����#���^/h0_�
i&p}>��
���v�P�Z����$@��X�������4�``Bf��6;��8�OkBdJ`���)�larAY��P�1�,X���z2S��j��=U2��'� �,u���
�Dsns��������E�8�Q�|0���7����A�E;��i�o6�;Zjx����������6~f�'P89�����+�oh>?��xz0��%���HPj�H1�9����upy_Y���y�d�2rb��C�+p��(��+�?
W������*���O�K���u��F�O�fo �Z�f��5aUjrC��GO���ai?4���47�K����]�3E�����6=������
W�0"cG�c=8Z�E!�Y�fe}��!^��4Go���?C(�c�H������( �����o��4e��j�J�5�#E�
s�4O5fg�p�]U*��2��p�Zu�i%?�����5�[C����Qy���'��u�����+�)���;�p�/�^p���|c�����U�yW�������`Rh��\�-��;+��03�X�������l���W�����tZ)!��O��|R�����m�OcJ�F8��2B!.���p�
�Q���"I]5�� ��-m��5}67e�����\�����Z�<@����5�";�YkFT��A�>L q:f��D��|Y�I���,T�����A���c�-��\�l�7��
�:����t����;b�X���m��* �v*���0K�����u8����&�JL.]��8#�`Q����*�=��s��z�L�^5��*9���������3�����Msi�VRa���q�j�9���sY���!x�G��5�q�������$����)�%F� �Ce�I��M�����V�)RV5�&]�V�DM�%2Jlr�oy����^�}�� ������C~k��M�m��R����|�����X��g���N�z������K�{���F�����t��w�c�3���J��]��|ju,�v����O�����n,����>�y@�O��{L���I����:��'^Gp��f��i�M���V���`��u.^�H���v�ga�uHV�p1�������������CV��c���2�\B��Q�|��3�D@g�$B�`�uH��D����_gvJ�o0������\vP���M���� ";]�?���p�=�t������DS��h�Co�<3�fy�4`Y��<�Y������R/���Mz7c�nfF�P�f���R���]5��nY^D�z��U�>dm�tf&��T���9�Y/���*oj���:n��O��k�K~PE�);�{��d�h`�A��6��������Y�����6�w�����i����>=UC���9XEC }����������N�kw��]�2�}���vM=�Ta���q��c���Q�Y��*�r�]r����XI�.}0#��ME�`���=��M��/��&�#q�B�"�T�NG��V.�U�t����&��x?���f�'~�����H��
F�p-9dl��M��m�������������0]��6Z_��[�}:��c���u&*����/��7$�T�y/�g]��2�j�����4�$oj,��3Z<����]5%�I
/��[�7��FS
B��j���S[k�� r�������?h�u
L"S���b��z i�w�� twx��<~@D�2���;�h3���>���r��\R~�
���0.-1G�o�G��P������C����u]�����B���uP#Kk�&j��
�Rl�q]y��:��o��1NB�����v�@��8a4�-�Y]�y���j�^
p8�)����?��Q.n�
n2vl��4���M���SS?��.>�/dm��B ��&����E��
�Xx���]5����h�e�s^A!��[���������$t�L�3�H��[0�A��e����&
l��������C���Y��Kh�g����Mk�}9ex�����!7��p�5�7�����h$';?.@�����z��'���.�r�������������I�-�2��S����d�#�9e���i7�������G^�x���.+��J/
&�U�s[�.�DyR��Ug�s�m t���G�NU��k�dg�S8�������c(f5L%7��Es�X 9i�������<��C��86Z��iS&B��ll)��M�O�or��D�<�k7����y�rYwm�+):���-��O���Q�.��M/f���8\1�O�
�:�����~;�,������������@<��?O�7�LOd)2;�Pk��*��bq�D?�������G�7(W�h����n�12�O,&������4�094u\���&]0����9� E�X�M.�'�Ba���c�!����-=�����jM�K���L�dM�^���]��d��_�_N���������M�K�f��>�����������g�a�.P��������NH_m� 4'L���2���M4�W�Z�H]���KZ�wn�b&C�e�p��4;�r'��su���SE
/^�Mm^:�U_Z�[r�&^��#�>���8Du���E��Aj�c��,X��w�^��sc*���Yw�����^H�w{v��n���r��������p�����J�%�C#�����Ij������V��t�s+�XqGd����ck<h���V��Y���o�5z��L�C��}���������6�[Rd�[��m
��7����V�*����4�^�����74����)��_�����FE0�0����r�B�����:�8�))^�������;~1�a(8��xQ��b�����~�O�����A���M_��
TE/_����0` �7�kM�p�5<��.��Q:���v<}����<+
Q��z��H[|��u�����-�x,o�����Mg���S�;� �mGFg�6��7������`:�twc��?h��@���4^��D���"�}}���
FL��~��#�o%��/�`�h~������Y��@�m���k%6�c����F���t���k��r0K�R.�bC��4�wy���8�3�����jSx6=C3�[q�,��O��������[��6N ���sS���`)� r��u������?��+R����g�� m���xJ��1eZ+gE����3������ ��_}���'e�����phE�W��X�����"��w{d������6�Sk��a8�0�Z�T��~BZ4;v���cc|DW����ep�m)����E�����0�^mc���$��F�\;1B/o7��^�p�5����/�����DZcy���6����z���7e��O�[�8���Y�_�aT��WH���������[;�}���k+�_��Q���v<x�����*p�m���V�:���
]h0�����W#�DJ0��c�e�7xSp�M������'��8�x�%b���2��ct��|���w����4���C�~��
�-�k �����
���6������W
sa��-�!������<�{�T �3;�5�����������W���i2�
��+&G�V@�Y#�������Y>�l1������V��d0~lS51
,?B��1a:��U�W�����0��A��K��\��5���5t��2�s{N-���
�}S�K����g�==��MSiO�-?T���.f�G(
E�Q�3�%Tf�7�n����&(����L:4��1*
X���Q������������Vp�%N�,�x��ob�� E�b��pd�*����i�1�f�$6����M��Q���!^<���>�1����d���06Xt\$+ �+��-�>����p���$�>����m�d��d)�o�1��oy���"
�,����B>O��f���G'���U���<:mr�/����2����7�� 5����M��%�����=n#[�(��l���/�5S)������vu��r�v��6�qA0%��m��I����~b������w��\����->
�����,�|w(�S��D�#B.�nz��D�T`�Wn!9l�'�G�9�Q#�T�,�\V�� \�dI>
C��9����8�����J�&�������/�6:
=6�^�~5������A�b �6MU�����OJ
��u�Gd���oH��)���.x���3��sW�+��-h.����j�yg��/�A2n $6��j�f��>~z����������5J��q�!+���E9
�2�`G����S� ��6O�����j7���0��l3����"c�fM��=:��4��y4�?�3��%�����.|D�C5Y�����j�W+{�����j�����0o���J1.;����W��������n�*����U3�q��!rq�i�����@j�h�T'��������N.��Lf�]���p�����3�fL��p�V����T!e���T��Q�������A.�$���'��G����y~���5�)~5�;�L�P����B$�W*�V���E/�h�u(�8�T��EV�����>=C����#��v��>���-#R�\�fK�
_2�f�]x��I���s$��7��x�����#` 2����{f�;5e���K����S&��?c�i�a�vL+�n���X�9/"N����\�1�:�]g����2�z>�PJ��$����uq�����8��H���.9����u���t�_.�2� �"�������5�e�By�V���t
��P���RO�]G��`t�'�������� @���U\���)�`�P�?��j�z��H���9���M�����(#���b���I��g2�ik�ZeJ�pB0T��<��(�{�tW����LH����m�&��F��[�I%���5�~��{�����b6���o�. �'�FFpP@��1�=�����1����G��L��%�'�L�\��G��K:����5~b;#�]�DW#����&�E�(�n];g�qo�%���n���~����HI�q��}[���r� �>�3b�2 ����9 �����n@��g���+��;�U��C�,��qK��$������X�>��=)��5S�cQ.pj��$!��u�����T�w�m���t���q�E�����G���T�G]*�������t��k�r!/�;�������11�r�1i��f+l�"��$�
�������$Is�S?�����!;�-!0G�"%��=�78���j��� �[����{�1���j����=��_�eJMct����������Tl��8�cv��c���G���(��7 �q� #�o0c�q��)��A�K�}BI���px93��0�Q���I�s��Np�;fgg������9?���H`ifS��`RF��e������6�l����#e�4�h�$��cB���;��;&2J�w�P:
f�pT>�QF�M�~� �=�����������@1��"�1;��� ,�d2���hc���v"�B��?
f����`���*��lS���3�N{���(�3z���
z�#�����gS������L�k�A���PQ1��Z�?
b�����3.8���`8loG��F���{:=OS����y$����`����1o{���;jK<n6����@#(`�������?�~:}�ixr����1
0V�[g�7&������ �����.��z����t�����6�O��:��n�p� 5n>�w���ufFC�X�2�.��k����r���w���,�:;F����d�Ra�����q5��+%T@�)L����9
�n�]V'���:#����P����+��������#�����
���*w�[��?��\+���x��AI�y��]��UyS��0�����DR���:���c+LR&��$�y��������|lv�f�m$�m�,�sN���dS+!�mm�YGG���#���*�2P��o3���N������AG- ������BL b;�i$���u�"4�C2��N��F�%�?h��f�:�i��m���4v�'��6�����>�����7t����Y�>?��-��1�k7��)�$�gO��c6O~��=S�����
e�k�j��c<��O����!LS��������7���#x�b��.�����j�O�8D�j%�#3
���t�����h�����MM!5A�~�ysVtY������_f$�� ��9v~w�<fLl��U��X��ur������=-�RS�F#����"�X7�Q�'��b�H��E#$���x�JB��s�
��?�
67����1�������w���v�p0����K�+�8��Q�%P�@�9��o9�7/�5x�o���`���$�M??����b��(S�2������{���W(j=�n���5'i���v���t}�cN����,����{H,��CyW��m.�j������f,�����A��F1���I}S�c��@%��J��������R��3.���I��_x�a��h<���H�Cf��\2�-��Q� F ��
�G��V#0�����pEW�F�N�A����C�R���o^�<�$v���7��%�wM6]�0��.�f80�dp�l6����W�����8"���a���X�t�L��;��S
@���;b{�1n�!���������
%�Ng�]�[}eN��X�s����`�������� ���vC��mBa����!��|,Dg%+�d��
�_��f^�t��y��2�fX"�k5
2��1�J�x� l{.SZ��:�N�+s���yS2�F�>�y���,�c�2�B��<>&c_-/�e�������Y��RJ�"�
iA���1���m��D�EY\��U��X'����`��l���2�F0SP
�M������o�M�q�Q�#X������B>�1����
�������mF`L�����(����i�(�-c�\Ppt@V
o��9�uS�a>��8�D���M�-s+g, b�`��%����{j.�VJ����
�U����Pp�b�B����\�3��f� /2��N�%J�8�� S�����M�XV� ���t�7m�v���%ARR�F���DJ�`����*��%����_<0�B���#p����3���ch�T�#�(��0�;m�U���:�u���%l��?Ax������� �]���� �{R�\��OZbr���IYBc(m�m�^���v[v:���>����>�p����3X0O@ab������Kp���Z3�-�2��B+�:���S�P�`��J���2��MHJ��p��N#�w[9�k�?�=�<���s����X^����y���(>�Q�������&��e?>�QA E�����mW����g�$$P�"��0�)�J:f#��n=TV��L,�f4��Y�&�vH�����]i�#���1�Tl��@`�>=�l��0�l����f��q��B�hF��
��w�@^k6���3��WJ&��|�]������G@x�.��f��I1@��$uv�[�!��<c�<�z��.��y~U��lK0\:~m_�x��4W%0;8��t���F8�������������c��������Q�')$&^�)y�h���`N��f�CN�tF����D�K��9�#h�H��6"�TL4�"V���2�����0������]t�����;.Y�m�*K^_iN5sG
~����ty�A��L�������qv�Kp�NC�5������l?`������T^��)W��� ���F@�nT���F �e�tU�%�]��'����0f ���
��F�/�J_���Jt(�l`H�N3�!�z���z��n���57sQ�)�{#8xelO��a����_�����`�������,�J�\�_A�,���\�?����d���C�%����B��C�$�2�����}�-���"o3��!K��e�x���bKMJ���Y(=1�#(J������Y:��hF�vn^ ��������,v���P�Ey>�]^��6����%&7y[n��b)bL;��;VL_�uG�I(���&��M�[�E�+y>����"T3����!�b$@����>����Jx RT�Y~r�p`��z�U8���m[w9E������e�(�J��8V�Nz�8�������d��*YAK�= /a�I�|3��!���&��nWe;R��(Yt"���r3F�g-���T!�^����<���V2�H�����MG��O��s��z��O�c$�+�8���6E;�cm-��S��6��HI�������D�tKuA[H��
v��'�f��B������w���OzL2�>Fzj�1P�{8���f��`O�mWW&���������C�J�g�
=Rz@�A��`N���#8��e l)
5������#��\:[$��Yv��>X$�q�v6����s������J0c�t�X��[I����������4�;��?�:�*[Q�)P�F���{z|&�o_�;����%o��<�GQ-N���P8�����Ol�B�X#��� ���R���x��_��1�E3�_1I�j��>���t��$W���5B��(�-dF�/)��@U�lx�{b�0C~ ��1���LB�����KBT�7j�;��7w#o�B� ��L���/�I�1�a�G�L8$��������J�����4p�����3]�L������P"����t�<!�UB���0j}*��$��]� �Z��z|��7�"���s^e�>V'J���B]�A�I�h(�493 NT�J�C�^��2K}^�|3��b�{C��7?������@�:�2�B������^�&z����a�-�E(�!��(���^���4�u�eN��i �nCY�e����v{a��Q�S����U���q2��9M����@�Q(�3 ��7���)X��("���[hf����'�Y�4���
���c���}�9(O���4�h���4��H���2�z������)�G�mD,�l�X�p�������]s_I��lpK�����5g�]�:��<�6_2Z�����v�Uv����dC�{��_��|z�[�t����B�6�����'�:�V_>���Aa<�` ��
�@B���}�+i�%�4�2
13���!���V�� #`��
�i�h��f����{J=()�����J������#m8F�_{��M���(�X���f�����i�F0 ��5�C���DB���������"�"M���X��
$ ���bTC�oF ��1@`�9pP� ��05�y�N@��TvK���5�8�������^�h
�M�����>�{���3Z�J�Q������U0������@��4�:�����1�y��]����#��Sb�$��M-3Y;v�+j� {K4E%��fT�����=~�41��@"e �B[�k�h4�-�^�oYO�I���+(^;3�=��`y�����n�_�^hG��$E��8x���������"{c�I,aXkn=�]/D5'ozo��h�u���%5Y���S '��EX�I�� �(���c�o��T�Y�T��r1�]�s��cE��f�b5��;Bv�����p��`�������[Y�W5E�N"@��4�8�H�O�����GF��M��^�M8�U�+��q��m��[~���9od�����R�V9��b�)��l0����HXGw6'8!���� ��#�5i��f`3D*!���,���p�������&jg�h��-�<�I�D��!B[����
�����@�GQ���/���9v v<9��S��6L<Z���q���6/���� ��bu6��D�B����LNBj�<�&�SMI�tpG
0tz�m�����G��~*AVP�������TXf���'��h�����j*f�Q*�F0�4i�0`����>�E��@G�+�S��
�����Pp�9 ��6zwx`�
�/>f�,�f
$�>���yd3[VM�[Z}����$�%4���_���X�C�0��Ja�Y"r��#g�6�R�;f,�����`�B��F��fn{�>nh'3�[m��b/G)�5�6���(IF���X���d�p��O�-�l�b{�$Z�/
���$�sq��3������`�*��Z� g�
1wG��1�����+*����CVM��9��S(�i��q$��xK���"�� �L��'�np���cp���+�����6������o��@p��;%���NyB� X�������m:2rK�t��H�?���eE}b�� �g��_��Ff���59����0?������F�O=g&����8hTf�*��Ql� �[H��#2[����mZm���� j�u0@�����I
���U�-/O5�����-��o����>[�`4u�c�9���$$�� ��� ���yH��@s��Je�-�i��(}�?g� �!����{,�|��������i`3���<�`��4o���]��~k �9xL��Gp����M��o!
�t��H��H����>�M��:LuQx8��Ui� �.M�5>�����x����
5�������Ul(.#Hr�>o�:��R'���*���!�-1�|�0Kx�����_��#Z�&hy�=�� z�#��w
k�
*��(�P���;P%��a|����f��ug��?QT#o�e}<���"�T�i�a
o�}�Y��;#�X�"�f���f�=^e�o�����#�� !j���^*%h�F��d��B�myh�o�����#��+��3����|���_�;*\�`��n!�wy�n-fUW+���#y���(-w&P�B`�"��_�U@���a���8D��<H�q���F�\�����?��7h
�������7f\�sNGD�5��i�a�H���QZ�s��(���B=>��� {v�� E ")�{}0���B �:Hh��Y�9����T��-�������(��]lP���AIg�9�V����i
��#�'����H���-��������2<����(����z���%f9>K��a��N�6����������6���h��\
s��5{�#�nq�I����L�^��-a1P�1#ZF���n1�lcSV[M�R"#h�zm�9���Uv�Q��#(8�s��;�E�G,t'��������E����n4o:��m��P;h`8�CJ��4��W@�l34U��:�D����d�B���o�����a���)�;f �����%��F�zGIB��k��� 1��,fL�(�I����K8�X�k{yJ���B��1�yU��6���)����������3���:��M�y�}���s� �p�)��������4 ���e���2uzcm������z�f�H��7�����t�y}�������f�@�?jF�$S�=��ti��f8>"����M��Z�Y��P�@hl������:Z�304B����<��UL�=�X,��"%��G�0�����3���T��1bZx��-�����*��W����Z�������]�=U��� ��6oBm�9�_���$�J�IMi�i���� ;4����G�����U<|I�3 ��`q8�-J����q��#� u���W��a�=�BfF����UU�_.MY�EA�����0��D|�����nYQ����8h�
��[A��#�)7�1�?������.h�x�s���� �����V����8G����rP�l~������������$:{8���|f�
U�/��+p�X�v �e�=;)��l3DwL��S�-��� P��'������y���$/�%�wvEo^abg�i�R *�0�����x*.=>cF�-��=d��\B}4�����>��H= �p�x_��'���Uf��2S���9iuV�PlG�6�U!�����M��jycl�n{�|Sn:#d?�� *��`���W��z�nH,���&K)Wy�km�YNRFIEK'nY>���8�8������q����A�����%������)�i�f��mp =��e�L���1�`�x@��*��n�nh:����@3c���>m�I�fhp@i�3�4lt�K�s��3��oD�����7����MVl���^Q�2Q���f���]7}�@[�<m�mR@M;�5��4��cK����4�@A�tN�18���� ���4A���>fh��i*[KH��1!��N�:����e��e4��6/L~T�P��u0���+F4�&:]G���B �N���%������]7���M��l��h�B��M^��(�g����p�������#������������LPd3$x�n�����
M�2V52� �S���#l��v�o���i`��[0W,q�5��� {����� ��j���G�'���;6%�}�b�-��3�~d���"��} �
R}�1��k���*2E�;v�VHk���D�GU�2�iB#4��{�ntJ���������r�P8���d[!�mIk�9��d�z3�ZOg�����fc=���P��:������zG�?���-�;�/�G��YX;�hy�}M^��T� } ��`�t*��XE�m��%��c����)���H�3��9f�1A>��6>} D�t<Q"�#(
T�&��a� xq��F$f(������sj;+L/f���{QLc3��b SI�a����z�MC E�o%a�V&����#������PW��,�'
E��������/�l3��/���N{"_�d;�7>~W*�����a��9�0�������5�G���H�G�����9l0)�5�d��D$o:�P��� �W��{�lP5x���������M)�G���7��g����<���� �����]��u���/��6��5p�,r�ZR�M��'%�!wf�l��]�,1@��B��lO�:�(�o�ioF��Hj0��*Q�#Fo��������/����������h����N��2�U�����g9��]}�7-��$6>�LB\��x����I�I�$��I��x�l@���>�uZ���C%�7#�5���Dr����+3D-������v4z��Z��W�o����wB����m^��z|y���1�k����b���[�����\<Owm5?��,
�� �0^J��]@D�G(��;5=�a�s�B�!��i����Iihse����AI�3�R�lR����Lk}jF�� ^��f,?3v���%�J����eO�h�%Js ���1��M^1
Qn���Z����n�B(�D�#���f��f��7��oNM�G�9�����l�x����?�ye�`VJ<���-�|�Eh�*F�2���u�y0�deNK���������Q>��8NG�A_ Aa�@�;�3Q���mO^LN���,��%"��X�1���1�!lK��3�q���>�Q�`�c�1& oQ=�����Ej�� �"����(��3?|�$Z�:3 Oi��e���� ��e�T�a����N(�-��m~��K���b�%g�l^���#X�cpxf �H# uv���.4f�����:2?�u��ZQ�K��5�`�>��*k���4%c]�/ �%�H�,�A���j�}����]��ZK�MkT>����s@~}d����Q�{4:�8m
��p���%$�?��EF�+r�BH6C�K��f����{�=pPyt��H�~�� ����{��>�� b�D
���W�XciT��+������s��PjC/����!I�n`�����1 R$DS2�G������H�����5�3�<����6��i���\!�>+�^��fB{g+�����(�fHI
�P(�����p�1��U�.��3#~�y5 ?�:�������N:9S����,����0@b�"�fh'���� =�8��\�����=�uz��(�XL�K!Hj
�V�<B_���(��W�b�����{^r��e���B�%6�!,����!o�VUYq�Q[�
�����}V��l�x�q#��E~�2����#�L��mH�"f�=�m���vD�a��dQ�b�;JW>3�Z�_�J�-��J�9����*�o�=9�c��3$��}!�T3!�
����L&��-��h��$8�����q� '��v>,qF
cK8t��������e���ll�0����$1�m��%���bb.��:�I)��$���0vi�����iCX\5��?�t���DS`�4c$A�f�1{��P���Y0G#��bkJ)O���w����h>b#Y���b��=K����8��cb�U�dbML����b�g)Uf@t�������%c�&j ��~#���}*������~���
���<)�n!���T=a��5��cf���-���&V��\v�vL�%� �� =��=JW�KY��
�����B���O����_(tlf����z�2�Z�p|�LZ =~O��,�n+���yn���=�P��=h�������a�}O����*l�=E�E��� a���S����F��!e�����X*��4��� E$�4#�:�KF�[w
�(l�`%��/�!�9���,mXX��B��o9�� D���@����7����&�;z���B�S�of
4��y*�0m�����@b~f�����B&����S0��W����WJj���_�M^_L�~BOtBc(�3�s��=�����y���GR��(�y�?���y�-=>��������������o�X#x�F
�8�?P:B�q��r�4�o�M�%��F,����L7��6 �y�����f�Y��Ss�=%�������K�7L��g�B�?��Y�-�v���k�����!��2}��NA�
��p|��B5�_
����e�0�����+L��$�2�i��zd+�Y�h+�g7CY`)���P�>��3yu�J��^���*m��3��!����m,=����-)��|G!0 �8���m���B 3D��������I7�� u��3,Rxx�a�!q����0!���#� �W�r�)s��a�c��<w� ?�]C�{�Vo1o�[,����/������I�SX����-o��������
dBCd����S�$��|�����k"��bSLd����%���,7M1c��_���8�����A����1��>��'o0�sdf�D��$f��$�4�0Sd)�����TSmY#�����e�tCR���c��E��5b�����K���a���=e�V��(
���������|��C:���>-Y����f�j�N-���$s!�#J�O�`I����F�4�p��c;;L3����1�W3���%z[��{#8U�nI�U���r��g����6w�3�Gp"�Cd��S`F��9��~��(����e����l���z/������!`�E����tQ���q,E��:��C���$�3@���MY��SvY"�)�����Vu�����P����@b(3?/Z�`�|�=�R�O�@�o����KR�H}�`�jgY�z+��E�M�:���&����p�oI *�)�e��'�VR�V#�QL���v 2�4%�v�F�c�(�\�9WCj�G��CB�q�;L8*���
���%~3��8�`��z����,��%����HI�$0�N�Io��r��p(a��Mn��O%D���Tk�4b3*lZ(�fd�������5J~�������@%5"��V�Av�F5C(~�@i:��O�-� ��m�K)�5���Y���R�M2��X����Dq1�Azo�BNpo��xJ�/YAd�0#��CZlo���Uw���~����&�w����w5������k�ZK+�&z��S�g��7��v�||q���e����^���������q�R.����h������E��=��o�%��P?G�p�E10��5�9��YlS�`sw%dn}���.�h�n�(t�8s�3����?�/"�m����0}����6W�I��B{ %�L����`�%����kv����0���kc����Xo9��MC3j����I��hRF�Vu��^07�����~�N��=����������J����%�(o���C�m���Z-`�����]��`�7:R1lnK�\Z���������?��`�C��X��
d
��� ��I�y���U����T�d��� -�p|sk��{���d�\��
>�����~�������c��ML +�G����IEGE&E�c�H���k�yq4��?�MsW��i�h���#�j��4O�9�BK�`.�-Q���%9�D�����]_:�]��x�l���U��j�O%��y)���s�v����(I���7T�&#&�,����*�n"$��
��7[s��aR��Y�"�s���a��2��a��
rX����<B����@��.���i���@��~U�|�d��AZ��
�>�Mb���!��_R�;?9��t/lt����JJ���i(W�)9l9y�=P���w���s������LK�-�$��q���
dfewJ�C��b����z0-'g|����X+!�Q>Ul���]�}�2>��=S��gM;��^9�D�-� #�����Te�7�g���if� "�%�N�?A���pH�J���N�r�=]g���Y�<���<3��cD�m�'���8�1+*�m"�~f
+�Cz�9z��=r��@LijH,�������`
�C��F[w�>]H�h���cc�E����s���}8�E��wLKNtCd'��� ���i� �5�c�j]�A���\�O�"�r��bj��8&g����n��b?I@��#5��h��z�}�|�V�������(v�|�gkNG��������[�a���f�6kJ�@�l�&���I��"�w�Nu� u���X�&g�8��mj��+� @)Um�d�N�.��{[}�a�-
H���a9���K�eR&���-��^1U��c�����J##�(]��{~����C�5�~��W�{��)?�g�i��g��Wd�<=#����9i1 �L��}�\9jF������2��'�c�3��O�+�F�tLp7Oe�T��~�����)^/�95����eR,��
2K� ���e�lZ�v �D�q
��o�PI~�z)������S0�43sPj�����������(���@ �)������p�"|���%�}�S��2x6���~K���pr���a��vA������X�{�J��������6��P;�Oe q�f��X����^�j�~[��+1���.��^�eP�`{��������%������y�d�H�G����X)(7�m>C��h�t�OG9y��������b�����!��Pl��z2������/�?���h}n���$*�n��Z���u k��Ev�!pk0�P��������y��z�3XU��d�Y ��R�O�[��8&_L�=��)Mb%�9���f������4?����������1�(��Y�s^��h;��.����RinDx�� ���oi��(as3��j;������D����nf�
n��lOk-7����a3�e�~K�d���xT�!�����9�4���d�d%��� ������� ��e������B��ld�����\s;[1@<������(���J���n��}"Vn���f�l!��{�D#xs@���Te�;�����j
3F�,��`S�\�K�����������B!~�y���s�r vq�-O��?�����!���;�?H=��0>f��#���
#�]�nL^���yk�Gh��7�E.T����`�3����$��F&�����P�g`�W�VZ���`o�t
9:RH<�Pp�f�7�e����
F 0��������L�{�f�3�~��O�|�� ��E ��n����I8
�����tHs� �"����'�v�����0AA:���[^��)f�D�j�'�.��AB���M��<�a���B*�L &���������q#|v��3Z��Z7������,z�g(�� ��~T7���f���*����3�V���
z_��g��z�e�8�T�>z�!��}����{C��<���������t��[�Z�=p���Q�@3v'3����}�h��Ff6�.�{sn��?:w��g��j����Wf\9�l_���@�.:������Bo?c>@����*�(��c���O�Q�a.�>�
<\sA~o����0��=`��[j���% �~cfYn-��j��i;�_�u3&bc7[[�������G�?=�� Rc������rk)����3}r�\��"��f���~^��fAu:Z'9�G�q��H>q�N���PP��r?�nh�d ���X{�[��o�#;�����L�A���9������MY����b��>���I|�������:���b�[���Am
.��\���}W�'�`�0���<&��6'�����d
[Q��9Ds/�6�>�jSp��G�H�i����9��#/]|T�C��>Z�������?����?��I�O���fu>���c��zsS��������N���^��=b���� �%
'�3�����%F����|��a�L�����S����n������X�S������s��=�3�3T�E%���T0Ik%R�1�L��
O4�y-6��H,�����������b�{,��{V��5M������ M<t�3���#R�����4��M��Z* @�6���w�M��Ul���+���{}_�%v/����3%������f%�+`w�6`���E��#S�Hl�����>����2[x���s�2�.�s(��3��p}��rm�}9���I}Hg��c����a_�y�-b���O?���@��}�)��"g�<���������������5��^�D��g |����d��Q��`���&E �2g@�G�>������(h�0��9�IKf��n�?�m���2k�0�~�-6�f�l���,^y{�����u�4� o�&��<��e�7A-�O�!�m��VH�l=�=V�c+3��?���!�r>/�/��3{~�0�%�K���
z�6�����������`�U(�rG�S�e�H�m�CRg�>�)��$e�8��+�C^�>�a�t���Z����|��q*��t3'H���[�����sL�d�1���P��pZ1���9D�R�,(�fv:�@�����)G/3���PXd��eL���<k�����������&�=5���v�l�;m�o3��dnN{($lk��Sj#�Af�8Dg�%�8�T�Q��
ks�va�3� ����#���������`����[V��V8�2ZA�E��>�r|)�W�s�9.=6_�8��l��c�
pj��E<���KU����*�O�HE?o���"�����m0J�dY����}������4Qt�`���R�x�aa�1s��,��8�Hn-q��7b
SKC�'WE�2��3����8������?gm�����Gs���YoD��jcY��(�F`�
�p�k�%�3���������>`�C�r�oin+jwY���IG}<C�����`���;jT�l��p����;@���b�~���e�����[��/�������e�=*KRZ;�����O�*@����3���|~�/�yyJ�TC��Y&��j��� ���=��Lo2���)gf��JAK���aS��kfD�KZE�55b�e .��l���[3f!���N�y��V��d�$��^Iy������x������E^���;��ATX�`�vH�<>��� �?����-hG��E]YG�����u�M�1��C�&���In�`����^�HK��6;|�*�-l��������l;c�2{o%�����O��q�}]I D�h8�#=��<c6���n���>vh���y�x+,2�i�au
Tgh{�Tg������Q`y�\&�u� 4�a�)��B@��c�bc��������0����J|a�/���!O�����[V�������MD���Z��'/ei��-���Y2C8K3'qf��W&���A��Y��|�T�1g.�����s��:���� j^�������$�cUn����C���y�P�?c�������j�>����}�,�Q�>��x��v��0���y���.p_<[>�r������pjh]JG��&K�������� �M�f���]�fxFq �0��H}�� 5Pyf�w��6��}��.����������:��,\k���A�z�>g���S��IfOn3{�J�R~�t+[���: �5K�����[U��F;��N�� bT:���"���m��,����Ft86���V����7P��\V�����%�f�HX�QV
wl�x�Sc_�����%�,�<j 9�}���L�����w>h{�k�oW��}��y�����al����g�/M�����T�����_����c���+�K�WO�����V�����mN�c��U�a��C�#����U�o��)�m}�nP�����?]]��2@�bG��v]aGs���S�~'o���g�������o�/J?�g�����O���� ���{
�>��^�{Qx�t��M���������@����m��W����~���M<��d_��l��i����]�/�2c����Tx��Y�����op��a����}y}{��Q��������M�����7����c�W��}88��<�
\E���[�������Xry]�����=��(�����Va?����tv����0w|0�M <�� FR���������U������=;�c�V���o)�0W�]x�)��2nf��������o�Z�yx���������� >�g^^���oc�k��6������t�rU�P����7� V���V�����������a���k�������W�\�[x���m���"��
�k���?�#$�Yy���o����k��e&��@��G�G{;���b�~�HB��\;�z����������2>?���
ja~ ; �a���@��@m���y��M�� 2qU3�8�})4;�>������~��'��b|�>GT,ui6lP��^<4���>��Rs����wuC*�+�,x��:��cnW�com�����N���=����;�J��lL�;���f�f0�6�����^�r��iw�9��n����J�V� �������h�p������H�?�����'�*�^���b��^�A�����@�N�{����.��,��6i�H.M��HK�my`M�����$K�� W���^zf�jrxM��2M����:��B�&�FG�]Ktc����q���qD+*�� Vd�$}Ak���2M
p,������fi�*2�5f�fh�7qP6����� ����bCs�;R���W��O�/�v��>�hs���"��7���$� �'D�p��zG�N�{e�S n�O���a�1u0���S���4��uT�R�l��a�H�D������z��v`�?�O��X�a��@�Z����o ��|v�����^��(��U�
���W�O6�N��V�u���@Q�T@��J����|�B+������^�7�m�~�7>M��@L��D����;���Z�~���%9v�/
�����,���`<��� ��\��������H��
���[D����x��X
&*�r���]k����@>]�jur�M�pY�����0u%I�Kx)��fq� L&��?�m�|�,QXB��O;��_I��3��MY�"�gS��b���w���!��J�x2;��kO����Bn����z�:�>G������T�<�s� ��(e��+$�3��������2N�y�k���`"��K�W������� �)PoH��@�9��K�Jr6 �a��gQ�K�e���;q��n<�ds{�/aO�/i�q���R��b>4�_i0l~��ig8���:��u^�|�X�����uH������"�����w>��v��!���T��vU�ph�)��A4 �D}���lLk�/����������%Z�l�i���]�@.�MB��gf!�x�����!�f��M��x�Q_��#)�0C�j������Pv3��������3HC��,L�OU%�����+��3����g4�l`�%��������
y"���]�^��(��o����y�0��q�
� ���^��D�m^�V�I�tx����7Olog[I�a�o�+��:�`���Ba��"�
�h���ai�K��M������ 1J�m���BrE���&���g
�FzCmh][�����Pb(T���@���W ��j�3 :��ac:t��9�`�[
�f�f��E�� if����p��_*���6wF7�������@�LwNND��l
NM�Q�Z5����rLo�.�H:Nf�>cZu&��D}KI�k�l�?f�,��" ��Q�����1��4 r�[�-��$I~��y����@�onh��`�Q+(�R��������<a�:��TW�!���*Z�I����NU*\b!���T�H2�Z�
U�� ��jn���` ��S]8�!5v+����c��Je��]���`10��-mY��NByb��������8�4��g_8����1��iwR�yl���(S�I.������1x� 3���{���l�����?��8}���������|�(?/���>S��2�����9��cQm}����-s�Vl���l�����(F2������r�%}����q&X1���� ��]����O���Y�:'�f�D�R�$o9�S�$m������p�;�����aq��� +�m�x�K~��42Z��%�����n�{Bv+Y����*0m+B�����69��l��;��B[�3#����d����YO|7MSA�)=��~��;��C���=��3f��//Y�FB����7@ �� �n���l�����|�J_�})� �c�7l�\�����
�!7���Q�*��
��~p�n�h��I���>o����A��r����O�t*�������%�Yh�u��uQ��*�f��z���$!
G���l��&�T�L����\KZ�<1�['��2������T��_@�����;p�8V��!���};�������L��']6��b��A&>�G,��x
4���Mv���
���Uo�����*l+H�s����L�+w�c]&�1:$h��gC�{�~=R`# %VG��L�����DC���r@���\���?1�A�wHL�S�H�M$�K
�)K�}�o�S8������1
�5-��u�������>�>/N���FC���;�� �S�K)A��/����}���gVV�:�P�}��T��>��:��y�����*R��N�o%���u�0;N\�I�OmD�4�e����?c<R�����/���
���w�]H3]����z�i�t
Lr� �g�V��s���
*�z����!L��5]Y���=<Q����H�������Vf+.v����U�4�7�c6vs8AdS� .�[�0��
vq�rKgM����+}�����{��D�����@Y}��T�IE���d�}v3�����c)���I�z,yY�@�5�L���N,2S'������i"\Juq��L*x�C
�Dw�V��ZGX���2f��Kg�_8Jj��c� �j{������ {�}� qvE����/7���i��C'���0*\Tf������y��� 3:���KO�����k�Z����*A���_M�l���U�5�u��q��
0�%�s���`���~�K�1L1|����>'~�5{h�� �����M
��Q� �����QV���1����`FI��s� K�K6�w��2h�T��I��m����d�e������)@��g�����we���+"�,�^�PaV����c��ur���V��V������;83�c�g���I����C������w�\w���)�\�V���za%X��.�� bqPd�l�Dt�Rx c���
��I��2a�C���u���]{�1m-�/��O�jc��no�g��1� �<e��
����c��a �(/�>_��e�G`��:�R��1�=���#�x.��z�$o���z����*^���N�"���@����o�v�����v#%UY6�1�����&�~ ��#�5����-N�q��� ����|I��?p��+�#$��XJ��Ibo��?��q$���/�CV���X:]�%g�������wb{�-� ��v�P8��h��|�����6���jMv�K� gM{��6ju���� [�QdojD������R?���_�9!�tc�&�x��G���q���H&�q��m06����xj�Nk]g�#��
f���3 �Y����O6�29�2�
�������f�8�0�>���8R�C�*��\��"���b�p�yv�i��������MTWp���b�s�^�2#�_�l0^��9;�0=�^~��{����u�5��l�}'Z��k������t8����P���� ��Rq�{����9;Oa7��uG�JVB-�>�)tb;���������w�Sz�����@i9b����1� ������3E��G}�m��?���n�V������c��d%�X���-0.���/�<;$�0�hEL��E+�Q��K�Z�/�0�WH9!���*�&�����\�I8r�������YcHq/�'����9�j����c[��
�M:FW^���~��� �H��L�� �[���'�����h3�O+���p�N���G���/�.��U��p�Qx� ���i��t�8?G��eF�����C�� ����p�"\�5�(���9b�]i0S3��.b\��xE� �Bf�����`�Y^4�
z�����bF.O�9���3O��J�F� ���c�v���*�����/�>���U�l���N����W`Fw,��Br��d�������%��]�����(���b��z�Lxu��L�;%�/����r�2D%-g�pG�8rE
������@��A����r�Cv&��N���#uN�Y��
j��8��3n�>6Hsr ,�r����q�&�`�=�U��$�����kG�����QhE�����f%��
F��q�p�}Y~9���#u������c��FAG�?)���pV��_����}��'P�&C����&��'���;��w��(�j�}8�I^����������9=��W,�V1�C��t��<5����S���N��R�����=�����!�"�;�+U�����:kJ��[�!�a|�^k����F�BwK���C"\�J�|2@�E���t)z?SUPxs�Q���F{sJn�����[q]l28�K���2s�t�5�&4��v
hU6T,\�����3We��4$���R�/��x U�tbo'�D��4vEg��1:|Z����
N���FZ���h{3�M���3�
8?�>��~,�0U' nnU�e3�AC(tm
5����q�:�C�Z4'���my�p2��}8
�4j���rx������ c���<������N����|�&kR>�����Fi���SY�U�3n����`������� #�YO�7��.��!Y��4 �Yc�'���������!e�]�lR6Y
A��8������FuN�Q����������6M�����?5�^'�uH�|��n����i�G��hjJ��fK�����������^3El�Z+S�*�M�D������o����c^�s�P���������o����� S�s�k �nU���1���3LiQv�b����{5��X��u��8���j��E|�Ax�[��y<��_$5!#9�u�s8/���x����C�Q�����E*W+�����0�x�������������I��P������~��T�!b�c��}|��������6�"�����+t��'6�=P��u��-�&�Z=��GN(R�x�������+r;(���d���9U��X�'
�> �; �����y .Or�����Nw�������r�`�,���aa�*,��� ��{J��/�Bt��t�`)����q�4b�����}�{�?W@44t��_��OT@.���"@��5�
�
����l��@�36yZ[���n���5�m�p�HgELoYkhp[n�6���r��
kv'0b2S��^�c������<��`3
�4�+ �#���?����4�M���f�l���ID������6�p#�b�����i5��$�|'0�"�[����A������hN,-r�� M��r�*0�{��_
��������Z�K�������u�Z�s
:��u:�)�L���V��|g|��@����{\��wc�&FHJ���Q;�`��
JN��s(����v����+���e���K���k~>V9�#�%��f � �����@]gW�)�'��*��YmXA��t��)��x�{50q0���F��u���10��-�:�./��-�x�0����Y�I��"�� ���k-�\BY2��He��{7nW�E��
�J����!�9i��)4����n95I���C3_��GC�pv�d�NERc��`/��0�� ��X��8f����#oAE�]��Z1.K����#v�4w1pMnl�0��/��������Vr�V�iz���L8?�c�}]�Qt�]7�����
NC�lYsv��Dfp�5��'�<+�r5D���s�
xD����o�����a��;qG�5� k��<�������NZ�0��M`}�2��]V?��?���9�����
?����A�\K�����J@���-,���M���.���4[��d���!l�w:x8o�*}��"�-� l }CC�C3������j��g��l�f ��].��(�9������gr����y��h*p���}��Q^�l�A�y�u�����DJ��<�
kI,���:�d ����g�<��6M��Y��>�,�� -a��W`�k��������������V��?���c���Z���G?�gZ#,���.�=�^�O�Ud���� �&�������kH���^���B��2�{-�]^��2���@A��/��=����Q���\����h��D:Du�����9�=�2���AOT�Np ��X7�������cvH�/e����_��n3R���t���9������,�RV�#kd�NL�
\2o��tG\�������i��=������S������;��Z��$0=bS���H�����4p���=�_������J����0P���
���4���7d��T�Z�)�Jx��0�_����
�?���T���Nr�������$�����3����:� _?#a�3�a�����T��^e�Z��M�-�����e�6��K�%���X�������|�^����*�����!�4���Dq��� �7��M�T8�q�}]�#2��4�e ��������npP����5�����NEA��b�k�'v��J9�#C��_�v7���%{����>a7k}z����:�f'��TP�Yc�n&Rp�l��nx�G�+�����[P�����{��@{gN�������_<�����T�1���:}���-�K�q�v�s�=:���lU�����w�O=1����0���Tm����KZ%���Pn�?R�X����L@!��{?�I�KC<0���0����5t�Nxx�L�,��]ktH�1���|~�AW`8v'o��<���?��@n[�_���W��H���d�R|b��C���<�C��c�����k�F�2Y�s@�5e�����h�px��Vo)����E��&�������]��5�C��:�������������u$�2�fE4�|
tw��T|a�j����e����P��m%������ghr��kY�WDMW�D.����h:V�t�� ���H���A�����U mp�>y�FwV#��)$h�%6��Z�!���:@L���,xGPx#�Y���j��P�����b����������Rh��� ����]��@L0+Hu%��K;�:�.+p�&)�HI�=�
�2���^RA�]U�����[2v��m�=��-�������8���e������H�&�
o5I�\&�n�������7Y�w+V#�)G�J��
4���Qn��
���� u0�+�&�`��/��`Y��:�x���#X����H�a ����>-���o#��C>��@g�$��C�O�S'����;��
���r/��hp��Qe��&KHV��]��������DG|�L�=�.��5��i~������FQ�zl;��T�%b���P� bS2������)��� ��9�����)e�����ht�:N�����\����r<}�h�����w�>o6mG4�b��c��8I�7�1m��&{���vu�}����� ;�x�i7�{�h<��~"��-�f��bO"n�\+�t����[K0(���8���;�y+�����K3��i#��c�YL��ru�8�c��p�4�z �NnpC�] D��%�:e��#�F�`m"��K��<~���h�&�s����w�1{��D�0�"�z�@ ��/<}
!G���l(�����^���%i�( ��k����)���Z�T��]V�6������n��`C�mMs#������a|��%��@�C���*�*��x1����>����M�������@�r����t@:������8����i�]OX����.[$���D�}������t_z�My�.�.����E0��m�KL�hk}jv+0����.�P\��-���u���I�}O6�# $�xh]�aPO<�pH�|�����[�������&9��qi%���V��0Y��qD�������[�`�Im^O�]Z��>����-��v��"���'�����������
���v������p�Q���4i;������[b�i�`��Z4��\�N���}��[��Y�$���@T�aLp&eo:���%���A�k���w��W�|Biw`�bx�i��!�^����U�����
�i2k���
n�ns�M�^�C���=��Z^ "(K��u�Y������������&E$w�?X���/nqe�.��i���
�%���u��b�z
�t<�>IX��>i��2k�+,f��M��n���-�J�XQ�X)�Q��{M���<���2������E?���^!��D_<pF�r-��7/�������8��nO(%��������?%�n ���������^lRf3u���a*�����s��2�f��������rl�BQ��N�'U��)*���3�WkPLs��8�X8�;��R��<e=�(������CJz�G����5x��F/������Y8 1��\J�!
��hfb���|_'t���F�f��
�*����*�308@�v��U����u�FJ�4��`)Cw�i`� -�� n��������@��C��U��F�$�![���F�z�r*Q[H�F�P=�}S�ao�s�������%[�_VB���}?��]w��A��b�=�3��.;�eC�N �
u�M0��Ph:�Qi���{���j���.��b������DZd���7�������N_Q��b���Tc���C9��Z(y�����`6-��3�� �Uy�Vl�@��QJq��F��3H����"_�z]:�����qZ��A*�����?������l���'����E.~��:���� ����q�K*a$�:&����9�VT%$�^R"! ��|�'3.D�pLr�a�-���-�S�M�x���#�?��C
��*�$<p��H@�]�$"��h�<�M���Wf����Y#�����1ynK�8���fT��R�;r�<���uJ�1�2e` �Wc�����$�
�$@��.�|��U������u����1;{�s'�����m���Wl��3��j���,�� 4��A�������[(%AYJ����C�*i�u!-�8 [�8I��� �� -�J1 G$7���JI�K���1X*��;"�o������xS�����,��;���V�>v|.����U�����@?���t^���i���e@�{Vl�]�#���� �����<Z���p���3�$iL�5�u)�o:E��m16<@���}��%7>�.��fo��o)�Y��ES�����eo�=�KR&�^�������c�9=@��K�]�8O�=G5Q����P4 �F]8*���RQ�8���,3�%#�M^RC&7�v�T�#d,%xd �/������2�W\���GO7tuA������������o��� \��P��%�kby��ws<�_������Qo�����d�Q]���<d �uHnW�-�U~� F"�pBD$pVD<_��iQ������h�dk���UafY��}�H� �7��?'�0U�)3�h�<�&���Q�}�D����l�A�z��'��d�}���������A���#��u�,�:H{ ���W�c���#je�B;X+�C�u�I.��k.���?��L�#9|���gd�Y��c��d���������.�=���*���h�����}M^� (wL9Kx�m�-o^Ab�����U{���r�b��5s��=����*���y��
`���\���|S�+$���N�1z1�� e�=�s����/(BGu�'���Z1�zrE�h��p6��W���:�{��r���]�\�D%KGS���c{����*�s�N�@+}�[���.��P�id&q7/I}�69�J���G8i�v�4zg�5D��M^�.H�V�&����b���^y����]q"����@����-�������nC���K�u{���C������L�+7����B�aZq���g��>�V�<0���6K���X}����t����||�������/�/���qS7(������M�<^#�_N
$n�]J{��������<>���s���:���� G�S�-P�2�`���i�%kx������+>6(����<f�}��i�C�������}m�\�X���O����@�s�����A���7�M����,����u����������=���[9������ ����v��\�m��3��>���0���#`]��,q���i�M���XVoQ���O����pd;@�_���;�S�2 ������,1�Mk��J�]]>`�m�JUpcZ�������������%���K�
;q0����MK�@K� 2� af2:�B"-��_�X_�4m�_��.�
l��_+��S�5j���r�+n6Z�e}�<>&/m
u$g�����,�=����)��c^+�vX� �vp��Ze�=�m����3��}S��l��N�,�~��\��Z >5@���d����_W���R��:�ErS�ij���m^��-N��� H�yO�.���}�|�cA?�g8�*r|h
�����jjsM��/@�!��CxU ��v4`�%D���Qg��i���;�l9�8������`Z�A�`(�#��L��p�Z���t�8���L#,v�v]tD)'e�9+]�A�E�N?$1�:����j��w�xH������<�\3� J�H$����+=�E�yM!pH�����*�t�{8!&M����cI���h ��gI����9@��m�"��r����ro�$iOv
W����>6�&<