Gather performance analysis
Hi,
I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue. In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.
So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.
IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.
So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.
--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;
Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interrupt
Perf analysis: Worker Node
- 99.14% shm_mq_sendv
- 74.10% shm_mq_send_bytes
+ 42.35% shm_mq_inc_bytes_written
- 32.56% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- 86.27% apic_timer_interrupt
+ 17.93% WaitLatch
From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.
2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.
Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions
1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.
Machine information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms
2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms
3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory. 2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.
Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)
-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS (tup_queue size= 1MB) : 36186.883 ms
4 WORKERS (tup_queue size= 4MB) : 36252.740 ms
8 WORKERS (tup_queue size= 64kB) : 42296.731 ms
8 WORKERS (tup_queue size= 1MB) : 37403.872 ms
8 WORKERS (tup_queue size= 4MB) : 39184.319 ms
16 WORKERS (tup_queue size= 64kB) : 42726.139 ms
16 WORKERS (tup_queue size= 1MB) : 36219.975 ms
16 WORKERS (tup_queue size= 4MB) : 39117.109 ms
Observation:
- There are some gains by increasing the tuple queue size but that is
limited up to 1MB, even tried with more data but the gain is not
linear and performance starts to drop after 4MB.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)
Conclusion:
With the above experiments,
1) I see a huge potential in the first idea so maybe we can do more
experiments based on the prototype implemented in the first idea and
we can expand the same for the reader and we can also try out the idea
of the dsa_pointers.
2) with the second idea of tuple queue size, I see some benefit but
that is not scaling so maybe, for now, there is no much point in
pursuing in this direction, but I think in the future if we want to
implement the redistribute operator then it is must for providing an
infinite tuple queue (maybe using temp file) to avoid deadlock.
Note: POC patches are not attached, I will send them after some more
experiments and cleanup, maybe I will try to optimize the reader part
as well before sending them.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 6, 2021 at 2:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.
...
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
Here is the POC patch for the same, apart from this extreme case I am
able to see improvement with this patch for normal parallel queries as
well.
Next, I will perform some more tests with different sets of queries to
see the improvements and post the results. I will also try to
optimize the reader on the similar line.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
POC-0001-Optimize-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=POC-0001-Optimize-shm_mq_send_bytes.patchDownload
From 4d19beb74e91a8259fcfd8da7ac8b1395c5f5c6d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH] Optimize shm_mq_send_bytes
Instead of freqnetly updating the bytes written in the shared memory,
only update when it crosses some limit (4k), this will avoid frequent
cache invalidation, atomic operations and SetLatch.
---
src/backend/access/transam/parallel.c | 4 +-
src/backend/executor/execParallel.c | 2 +-
src/backend/storage/ipc/shm_mq.c | 87 ++++++++++++++++++++++-------------
src/include/storage/shm_mq.h | 2 +-
src/test/modules/test_shm_mq/setup.c | 2 +-
5 files changed, 59 insertions(+), 38 deletions(-)
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3550ef1..2571807 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -431,7 +431,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
shm_mq *mq;
start = error_queue_space + i * PARALLEL_ERROR_QUEUE_SIZE;
- mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE);
+ mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE, 0);
shm_mq_set_receiver(mq, MyProc);
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
@@ -497,7 +497,7 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
shm_mq *mq;
start = error_queue_space + i * PARALLEL_ERROR_QUEUE_SIZE;
- mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE);
+ mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE, 0);
shm_mq_set_receiver(mq, MyProc);
pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f9dd5fc..373f920 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -567,7 +567,7 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinitialize)
mq = shm_mq_create(tqueuespace +
((Size) i) * PARALLEL_TUPLE_QUEUE_SIZE,
- (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+ (Size) PARALLEL_TUPLE_QUEUE_SIZE, 4096);
shm_mq_set_receiver(mq, MyProc);
responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..eab7bcc 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -53,6 +53,10 @@
* mq_ring_size and mq_ring_offset never change after initialization, and
* can therefore be read without the lock.
*
+ * After every mq_min_send_size bytes are written the sender will update the
+ * shared memory value mq_bytes_written to avoid frequent cache invalidations
+ * and atomic operations.
+ *
* Importantly, mq_ring can be safely read and written without a lock.
* At any given time, the difference between mq_bytes_read and
* mq_bytes_written defines the number of bytes within mq_ring that contain
@@ -77,6 +81,7 @@ struct shm_mq
pg_atomic_uint64 mq_bytes_read;
pg_atomic_uint64 mq_bytes_written;
Size mq_ring_size;
+ Size mq_min_send_size;
bool mq_detached;
uint8 mq_ring_offset;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
@@ -139,6 +144,9 @@ struct shm_mq_handle
Size mqh_consume_pending;
Size mqh_partial_bytes;
Size mqh_expected_bytes;
+ uint64 mqh_last_sent_bytes;
+ uint64 mqh_bytes_read;
+ uint64 mqh_bytes_written;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
MemoryContext mqh_context;
@@ -155,7 +163,6 @@ static bool shm_mq_counterparty_gone(shm_mq *mq,
static bool shm_mq_wait_internal(shm_mq *mq, PGPROC **ptr,
BackgroundWorkerHandle *handle);
static void shm_mq_inc_bytes_read(shm_mq *mq, Size n);
-static void shm_mq_inc_bytes_written(shm_mq *mq, Size n);
static void shm_mq_detach_callback(dsm_segment *seg, Datum arg);
/* Minimum queue size is enough for header and at least one chunk of data. */
@@ -168,7 +175,7 @@ MAXALIGN(offsetof(shm_mq, mq_ring)) + MAXIMUM_ALIGNOF;
* Initialize a new shared message queue.
*/
shm_mq *
-shm_mq_create(void *address, Size size)
+shm_mq_create(void *address, Size size, Size min_send_size)
{
shm_mq *mq = address;
Size data_offset = MAXALIGN(offsetof(shm_mq, mq_ring));
@@ -188,6 +195,7 @@ shm_mq_create(void *address, Size size)
mq->mq_ring_size = size - data_offset;
mq->mq_detached = false;
mq->mq_ring_offset = data_offset - offsetof(shm_mq, mq_ring);
+ mq->mq_min_send_size = min_send_size;
return mq;
}
@@ -297,6 +305,9 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
mqh->mqh_length_word_complete = false;
mqh->mqh_counterparty_attached = false;
mqh->mqh_context = CurrentMemoryContext;
+ mqh->mqh_bytes_read = 0;
+ mqh->mqh_bytes_written = 0;
+ mqh->mqh_last_sent_bytes = 0;
if (seg != NULL)
on_dsm_detach(seg, shm_mq_detach_callback, PointerGetDatum(mq));
@@ -518,8 +529,17 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
mqh->mqh_counterparty_attached = true;
}
- /* Notify receiver of the newly-written data, and return. */
- SetLatch(&receiver->procLatch);
+ /*
+ * Notify receiver of the newly-written data, only if we have written
+ * enough data.
+ */
+ if (mqh->mqh_bytes_written - mqh->mqh_last_sent_bytes > mq->mq_min_send_size)
+ {
+ pg_atomic_write_u64(&mq->mq_bytes_written, mqh->mqh_bytes_written);
+ mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+ SetLatch(&receiver->procLatch);
+ }
+
return SHM_MQ_SUCCESS;
}
@@ -816,6 +836,16 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
void
shm_mq_detach(shm_mq_handle *mqh)
{
+
+ if (mqh->mqh_queue->mq_min_send_size > 0 &&
+ mqh->mqh_bytes_written > mqh->mqh_last_sent_bytes)
+ {
+ pg_atomic_write_u64(&mqh->mqh_queue->mq_bytes_written,
+ mqh->mqh_bytes_written);
+ mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+ SetLatch(&mqh->mqh_queue->mq_receiver->procLatch);
+ }
+
/* Notify counterparty that we're outta here. */
shm_mq_detach_internal(mqh->mqh_queue);
@@ -886,15 +916,24 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
uint64 used;
Size ringsize = mq->mq_ring_size;
Size available;
-
while (sent < nbytes)
{
uint64 rb;
uint64 wb;
/* Compute number of ring buffer bytes used and available. */
- rb = pg_atomic_read_u64(&mq->mq_bytes_read);
- wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+ wb = mqh->mqh_bytes_written;
+ rb = mqh->mqh_bytes_read;
+
+ /*
+ * If based on the local values of the mqh_bytes_read, we don't have
+ * enough size in the ring buffer then read the latest value from the
+ * shared memory. Avoid reading everytime from the shared memory will
+ * reduce the atomic operations as well as cache misses.
+ */
+ if ((ringsize - (wb-rb)) < nbytes)
+ mqh->mqh_bytes_read = rb = pg_atomic_read_u64(&mq->mq_bytes_read);
+
Assert(wb >= rb);
used = wb - rb;
Assert(used <= ringsize);
@@ -957,6 +996,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
* Therefore, we can read it without acquiring the spinlock.
*/
Assert(mqh->mqh_counterparty_attached);
+ pg_atomic_write_u64(&mq->mq_bytes_written, mqh->mqh_bytes_written);
+ mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+
SetLatch(&mq->mq_receiver->procLatch);
/* Skip manipulation of our latch if nowait = true. */
@@ -1009,13 +1051,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
* MAXIMUM_ALIGNOF, and each read is as well.
*/
Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
- shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
/*
- * For efficiency, we don't set the reader's latch here. We'll do
- * that only when the buffer fills up or after writing an entire
- * message.
+ * For efficiency, we don't update the bytes written in the shared
+ * memory and also don't set the reader's latch here. We'll do
+ * that only when the buffer fills up or after writing more than
+ * writing 4k data.
*/
+ mqh->mqh_bytes_written += MAXALIGN(sendnow);
}
}
@@ -1253,28 +1296,6 @@ shm_mq_inc_bytes_read(shm_mq *mq, Size n)
SetLatch(&sender->procLatch);
}
-/*
- * Increment the number of bytes written.
- */
-static void
-shm_mq_inc_bytes_written(shm_mq *mq, Size n)
-{
- /*
- * Separate prior reads of mq_ring from the write of mq_bytes_written
- * which we're about to do. Pairs with the read barrier found in
- * shm_mq_receive_bytes.
- */
- pg_write_barrier();
-
- /*
- * There's no need to use pg_atomic_fetch_add_u64 here, because nobody
- * else can be changing this value. This method avoids taking the bus
- * lock unnecessarily.
- */
- pg_atomic_write_u64(&mq->mq_bytes_written,
- pg_atomic_read_u64(&mq->mq_bytes_written) + n);
-}
-
/* Shim for on_dsm_detach callback. */
static void
shm_mq_detach_callback(dsm_segment *seg, Datum arg)
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..58a34a9 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -47,7 +47,7 @@ typedef enum
* or written, but they need not be set by the same process. Each must be
* set exactly once.
*/
-extern shm_mq *shm_mq_create(void *address, Size size);
+extern shm_mq *shm_mq_create(void *address, Size size, Size min_send_size);
extern void shm_mq_set_receiver(shm_mq *mq, PGPROC *);
extern void shm_mq_set_sender(shm_mq *mq, PGPROC *);
diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c
index e05e97c..0bbc7cb 100644
--- a/src/test/modules/test_shm_mq/setup.c
+++ b/src/test/modules/test_shm_mq/setup.c
@@ -143,7 +143,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers,
shm_mq *mq;
mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
- (Size) queue_size);
+ (Size) queue_size, 0);
shm_toc_insert(toc, i + 1, mq);
if (i == 0)
--
1.8.3.1
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.
But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)
...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...
Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.
In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.
Additionally, I've done some more experiments with artificial workloads, as
well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.
Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.
I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload
From 1142c46eaa3dcadc4bb28133ce873476ba3067c4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v1] Optimize parallel tuple send (shm_mq_send_bytes)
Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size. This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
src/backend/executor/tqueue.c | 2 +-
src/backend/libpq/pqmq.c | 7 +++-
src/backend/storage/ipc/shm_mq.c | 65 +++++++++++++++++++++++++++++------
src/include/storage/shm_mq.h | 8 +++--
src/test/modules/test_shm_mq/test.c | 7 ++--
src/test/modules/test_shm_mq/worker.c | 2 +-
6 files changed, 72 insertions(+), 19 deletions(-)
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
/* Send the tuple itself. */
tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
- result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+ result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false, false);
if (should_free)
pfree(tuple);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index d1a1f47..846494b 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -154,7 +154,12 @@ mq_putmessage(char msgtype, const char *s, size_t len)
for (;;)
{
- result = shm_mq_sendv(pq_mq_handle, iov, 2, true);
+ /*
+ * Immediately notify the receiver by passing force_flush as true so
+ * that the shared memory value is updated before we send the parallel
+ * message signal right after this.
+ */
+ result = shm_mq_sendv(pq_mq_handle, iov, 2, true, true);
if (pq_mq_parallel_leader_pid != 0)
SendProcSignal(pq_mq_parallel_leader_pid,
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..3e1781c 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -120,6 +120,12 @@ struct shm_mq
* message itself, and mqh_expected_bytes - which is used only for reads -
* tracks the expected total size of the payload.
*
+ * mqh_send_pending, is number of bytes that is written to the queue but not
+ * yet updated in the shared memory. We will not update it until the written
+ * data is 1/4th of the ring size or the tuple queue is full. This will
+ * prevent frequent CPU cache misses, and it will also avoid frequent
+ * SetLatch() calls, which are quite expensive.
+ *
* mqh_counterparty_attached tracks whether we know the counterparty to have
* attached to the queue at some previous point. This lets us avoid some
* mutex acquisitions.
@@ -139,6 +145,7 @@ struct shm_mq_handle
Size mqh_consume_pending;
Size mqh_partial_bytes;
Size mqh_expected_bytes;
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
MemoryContext mqh_context;
@@ -294,6 +301,7 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
mqh->mqh_consume_pending = 0;
mqh->mqh_partial_bytes = 0;
mqh->mqh_expected_bytes = 0;
+ mqh->mqh_send_pending = 0;
mqh->mqh_length_word_complete = false;
mqh->mqh_counterparty_attached = false;
mqh->mqh_context = CurrentMemoryContext;
@@ -317,16 +325,22 @@ shm_mq_set_handle(shm_mq_handle *mqh, BackgroundWorkerHandle *handle)
/*
* Write a message into a shared message queue.
+ *
+ * When force_flush = true, we immediately update the shm_mq's mq_bytes_written
+ * and notify the receiver if it is already attached. Otherwise, we don't
+ * update it until we have written an amount of data greater than 1/4th of the
+ * ring size.
*/
shm_mq_result
-shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
+shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
+ bool force_flush)
{
shm_mq_iovec iov;
iov.data = data;
iov.len = nbytes;
- return shm_mq_sendv(mqh, &iov, 1, nowait);
+ return shm_mq_sendv(mqh, &iov, 1, nowait, force_flush);
}
/*
@@ -343,9 +357,12 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
* arguments, each time the process latch is set. (Once begun, the sending
* of a message cannot be aborted except by detaching from the queue; changing
* the length or payload will corrupt the queue.)
+ *
+ * For force_flush, refer comments atop shm_mq_send interface.
*/
shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
+shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+ bool force_flush)
{
shm_mq_result res;
shm_mq *mq = mqh->mqh_queue;
@@ -518,8 +535,19 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
mqh->mqh_counterparty_attached = true;
}
- /* Notify receiver of the newly-written data, and return. */
- SetLatch(&receiver->procLatch);
+ /*
+ * If we have written more than 1/4 of the ring or the caller has
+ * requested force flush, mark it as written in shared memory and notify
+ * the receiver. For more detail refer comments atop shm_mq_handle
+ * structure.
+ */
+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
+ {
+ shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+ SetLatch(&receiver->procLatch);
+ mqh->mqh_send_pending = 0;
+ }
+
return SHM_MQ_SUCCESS;
}
@@ -816,6 +844,13 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
void
shm_mq_detach(shm_mq_handle *mqh)
{
+ /* Before detaching, notify already written data to the receiver. */
+ if (mqh->mqh_send_pending > 0)
+ {
+ shm_mq_inc_bytes_written(mqh->mqh_queue, mqh->mqh_send_pending);
+ mqh->mqh_send_pending = 0;
+ }
+
/* Notify counterparty that we're outta here. */
shm_mq_detach_internal(mqh->mqh_queue);
@@ -894,7 +929,7 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
/* Compute number of ring buffer bytes used and available. */
rb = pg_atomic_read_u64(&mq->mq_bytes_read);
- wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+ wb = pg_atomic_read_u64(&mq->mq_bytes_written) + mqh->mqh_send_pending;
Assert(wb >= rb);
used = wb - rb;
Assert(used <= ringsize);
@@ -951,6 +986,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
}
else if (available == 0)
{
+ /* Update the pending send bytes in the shared memory. */
+ shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+
/*
* Since mq->mqh_counterparty_attached is known to be true at this
* point, mq_receiver has been set, and it can't change once set.
@@ -959,6 +997,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
Assert(mqh->mqh_counterparty_attached);
SetLatch(&mq->mq_receiver->procLatch);
+ /*
+ * We have just updated the mqh_send_pending bytes in the shared
+ * memory so reset it.
+ */
+ mqh->mqh_send_pending = 0;
+
/* Skip manipulation of our latch if nowait = true. */
if (nowait)
{
@@ -1009,13 +1053,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
* MAXIMUM_ALIGNOF, and each read is as well.
*/
Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
- shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
/*
- * For efficiency, we don't set the reader's latch here. We'll do
- * that only when the buffer fills up or after writing an entire
- * message.
+ * For efficiency, we don't update the bytes written in the shared
+ * memory and also don't set the reader's latch here. Refer to
+ * the comments atop the shm_mq_handle structure for more
+ * information.
*/
+ mqh->mqh_send_pending += MAXALIGN(sendnow);
}
}
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..cb1c555 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -70,11 +70,13 @@ extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
/* Send or receive messages. */
extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
- Size nbytes, const void *data, bool nowait);
-extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh,
- shm_mq_iovec *iov, int iovcnt, bool nowait);
+ Size nbytes, const void *data, bool nowait,
+ bool force_flush);
+extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov,
+ int iovcnt, bool nowait, bool force_flush);
extern shm_mq_result shm_mq_receive(shm_mq_handle *mqh,
Size *nbytesp, void **datap, bool nowait);
+extern void shm_mq_flush(shm_mq_handle *mqh);
/* Wait for our counterparty to attach to the queue. */
extern shm_mq_result shm_mq_wait_for_attach(shm_mq_handle *mqh);
diff --git a/src/test/modules/test_shm_mq/test.c b/src/test/modules/test_shm_mq/test.c
index 2d8d695..be074f0 100644
--- a/src/test/modules/test_shm_mq/test.c
+++ b/src/test/modules/test_shm_mq/test.c
@@ -73,7 +73,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
test_shm_mq_setup(queue_size, nworkers, &seg, &outqh, &inqh);
/* Send the initial message. */
- res = shm_mq_send(outqh, message_size, message_contents, false);
+ res = shm_mq_send(outqh, message_size, message_contents, false, true);
if (res != SHM_MQ_SUCCESS)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -97,7 +97,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
break;
/* Send it back out. */
- res = shm_mq_send(outqh, len, data, false);
+ res = shm_mq_send(outqh, len, data, false, true);
if (res != SHM_MQ_SUCCESS)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -177,7 +177,8 @@ test_shm_mq_pipelined(PG_FUNCTION_ARGS)
*/
if (send_count < loop_count)
{
- res = shm_mq_send(outqh, message_size, message_contents, true);
+ res = shm_mq_send(outqh, message_size, message_contents, true,
+ true);
if (res == SHM_MQ_SUCCESS)
{
++send_count;
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 2180776..9b037b9 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -190,7 +190,7 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
break;
/* Send it back out. */
- res = shm_mq_send(outqh, len, data, false);
+ res = shm_mq_send(outqh, len, data, false, true);
if (res != SHM_MQ_SUCCESS)
break;
}
--
1.8.3.1
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
Some minor comments.
For shm_mq.c, existing comment says:
* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.
+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
The above can be written as:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
so that when force_flush is true, the other condition is not evaluated.
Cheers
On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com>
wrote:On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from
generate_series(1,200000000) as i;set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.I am planning to do further testing for identifying the optimal batch
size in different workloads. WIth above workload I am seeing similar
results with batch size 4k to 16k (1/4 of the ring size) so in the attached
patch I have kept as 1/4 of the ring size. We might change that based on
more analysis and testing.--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.comHi,
Some minor comments.
For shm_mq.c, existing comment says:* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
The above can be written as:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
so that when force_flush is true, the other condition is not evaluated.
Cheers
There was a typo in suggested code above. It should be:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
Cheers
Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.
Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.
FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interruptPerf analysis: Worker Node - 99.14% shm_mq_sendv - 74.10% shm_mq_send_bytes + 42.35% shm_mq_inc_bytes_written - 32.56% pg_atomic_read_u64 - pg_atomic_read_u64_impl - 86.27% apic_timer_interrupt + 17.93% WaitLatchFrom the perf results and also from the code analysis I can think of
two main problems here
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).
Greetings,
Andres Freund
On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.
Thanks for showing interest.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.
I have attached the output of the lscpu. I think batching the data before
updating in the shared memory will win because we are avoiding the frequent
cache misses and IMHO the benefit will be more in the machine with more CPU
sockets.
Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?
Actually, my intention behind the tuple queue size was to just see the
behavior. Do we really have the problem of workers stalling on queue while
sending the tuple, the perf report showed some load on WaitLatch on the
worker side so I did this experiment. I saw some benefits but it was not
really huge. I am not sure whether we want to just increase the tuple
queue size or make it tunable, but if we want to support redistribute
operators in future sometime then maybe we should make it dynamically
growing at runtime, maybe using dsa or dsa + shared files.
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.
Yeah, I was thinking about such cases, basically, this design can increase
the startup cost of the Gather node, I will also try to derive such cases
and test them.
FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?
Actually, these both patches are the same,
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the
cleaner version of the first patch. For configurable tuple queue size I
did not send a patch, because that is I just used for the testing purpose
and never intended to to propose anything. My most of the latest
performance data I sent with only
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with
default tuple queue size.
But I am attaching both the patches in case you want to play around.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload
From 84c2e46808b59f6bf7a782f6b0735dafc4e89e13 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v1 1/2] Optimize parallel tuple send (shm_mq_send_bytes)
Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size. This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
src/backend/executor/tqueue.c | 2 +-
src/backend/libpq/pqmq.c | 7 +++-
src/backend/storage/ipc/shm_mq.c | 65 +++++++++++++++++++++++++++++------
src/include/storage/shm_mq.h | 8 +++--
src/test/modules/test_shm_mq/test.c | 7 ++--
src/test/modules/test_shm_mq/worker.c | 2 +-
6 files changed, 72 insertions(+), 19 deletions(-)
diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
/* Send the tuple itself. */
tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
- result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+ result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false, false);
if (should_free)
pfree(tuple);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index d1a1f47..846494b 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -154,7 +154,12 @@ mq_putmessage(char msgtype, const char *s, size_t len)
for (;;)
{
- result = shm_mq_sendv(pq_mq_handle, iov, 2, true);
+ /*
+ * Immediately notify the receiver by passing force_flush as true so
+ * that the shared memory value is updated before we send the parallel
+ * message signal right after this.
+ */
+ result = shm_mq_sendv(pq_mq_handle, iov, 2, true, true);
if (pq_mq_parallel_leader_pid != 0)
SendProcSignal(pq_mq_parallel_leader_pid,
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..3e1781c 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -120,6 +120,12 @@ struct shm_mq
* message itself, and mqh_expected_bytes - which is used only for reads -
* tracks the expected total size of the payload.
*
+ * mqh_send_pending, is number of bytes that is written to the queue but not
+ * yet updated in the shared memory. We will not update it until the written
+ * data is 1/4th of the ring size or the tuple queue is full. This will
+ * prevent frequent CPU cache misses, and it will also avoid frequent
+ * SetLatch() calls, which are quite expensive.
+ *
* mqh_counterparty_attached tracks whether we know the counterparty to have
* attached to the queue at some previous point. This lets us avoid some
* mutex acquisitions.
@@ -139,6 +145,7 @@ struct shm_mq_handle
Size mqh_consume_pending;
Size mqh_partial_bytes;
Size mqh_expected_bytes;
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
MemoryContext mqh_context;
@@ -294,6 +301,7 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
mqh->mqh_consume_pending = 0;
mqh->mqh_partial_bytes = 0;
mqh->mqh_expected_bytes = 0;
+ mqh->mqh_send_pending = 0;
mqh->mqh_length_word_complete = false;
mqh->mqh_counterparty_attached = false;
mqh->mqh_context = CurrentMemoryContext;
@@ -317,16 +325,22 @@ shm_mq_set_handle(shm_mq_handle *mqh, BackgroundWorkerHandle *handle)
/*
* Write a message into a shared message queue.
+ *
+ * When force_flush = true, we immediately update the shm_mq's mq_bytes_written
+ * and notify the receiver if it is already attached. Otherwise, we don't
+ * update it until we have written an amount of data greater than 1/4th of the
+ * ring size.
*/
shm_mq_result
-shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
+shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
+ bool force_flush)
{
shm_mq_iovec iov;
iov.data = data;
iov.len = nbytes;
- return shm_mq_sendv(mqh, &iov, 1, nowait);
+ return shm_mq_sendv(mqh, &iov, 1, nowait, force_flush);
}
/*
@@ -343,9 +357,12 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
* arguments, each time the process latch is set. (Once begun, the sending
* of a message cannot be aborted except by detaching from the queue; changing
* the length or payload will corrupt the queue.)
+ *
+ * For force_flush, refer comments atop shm_mq_send interface.
*/
shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
+shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+ bool force_flush)
{
shm_mq_result res;
shm_mq *mq = mqh->mqh_queue;
@@ -518,8 +535,19 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
mqh->mqh_counterparty_attached = true;
}
- /* Notify receiver of the newly-written data, and return. */
- SetLatch(&receiver->procLatch);
+ /*
+ * If we have written more than 1/4 of the ring or the caller has
+ * requested force flush, mark it as written in shared memory and notify
+ * the receiver. For more detail refer comments atop shm_mq_handle
+ * structure.
+ */
+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
+ {
+ shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+ SetLatch(&receiver->procLatch);
+ mqh->mqh_send_pending = 0;
+ }
+
return SHM_MQ_SUCCESS;
}
@@ -816,6 +844,13 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
void
shm_mq_detach(shm_mq_handle *mqh)
{
+ /* Before detaching, notify already written data to the receiver. */
+ if (mqh->mqh_send_pending > 0)
+ {
+ shm_mq_inc_bytes_written(mqh->mqh_queue, mqh->mqh_send_pending);
+ mqh->mqh_send_pending = 0;
+ }
+
/* Notify counterparty that we're outta here. */
shm_mq_detach_internal(mqh->mqh_queue);
@@ -894,7 +929,7 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
/* Compute number of ring buffer bytes used and available. */
rb = pg_atomic_read_u64(&mq->mq_bytes_read);
- wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+ wb = pg_atomic_read_u64(&mq->mq_bytes_written) + mqh->mqh_send_pending;
Assert(wb >= rb);
used = wb - rb;
Assert(used <= ringsize);
@@ -951,6 +986,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
}
else if (available == 0)
{
+ /* Update the pending send bytes in the shared memory. */
+ shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+
/*
* Since mq->mqh_counterparty_attached is known to be true at this
* point, mq_receiver has been set, and it can't change once set.
@@ -959,6 +997,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
Assert(mqh->mqh_counterparty_attached);
SetLatch(&mq->mq_receiver->procLatch);
+ /*
+ * We have just updated the mqh_send_pending bytes in the shared
+ * memory so reset it.
+ */
+ mqh->mqh_send_pending = 0;
+
/* Skip manipulation of our latch if nowait = true. */
if (nowait)
{
@@ -1009,13 +1053,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
* MAXIMUM_ALIGNOF, and each read is as well.
*/
Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
- shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
/*
- * For efficiency, we don't set the reader's latch here. We'll do
- * that only when the buffer fills up or after writing an entire
- * message.
+ * For efficiency, we don't update the bytes written in the shared
+ * memory and also don't set the reader's latch here. Refer to
+ * the comments atop the shm_mq_handle structure for more
+ * information.
*/
+ mqh->mqh_send_pending += MAXALIGN(sendnow);
}
}
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..cb1c555 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -70,11 +70,13 @@ extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
/* Send or receive messages. */
extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
- Size nbytes, const void *data, bool nowait);
-extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh,
- shm_mq_iovec *iov, int iovcnt, bool nowait);
+ Size nbytes, const void *data, bool nowait,
+ bool force_flush);
+extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov,
+ int iovcnt, bool nowait, bool force_flush);
extern shm_mq_result shm_mq_receive(shm_mq_handle *mqh,
Size *nbytesp, void **datap, bool nowait);
+extern void shm_mq_flush(shm_mq_handle *mqh);
/* Wait for our counterparty to attach to the queue. */
extern shm_mq_result shm_mq_wait_for_attach(shm_mq_handle *mqh);
diff --git a/src/test/modules/test_shm_mq/test.c b/src/test/modules/test_shm_mq/test.c
index 2d8d695..be074f0 100644
--- a/src/test/modules/test_shm_mq/test.c
+++ b/src/test/modules/test_shm_mq/test.c
@@ -73,7 +73,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
test_shm_mq_setup(queue_size, nworkers, &seg, &outqh, &inqh);
/* Send the initial message. */
- res = shm_mq_send(outqh, message_size, message_contents, false);
+ res = shm_mq_send(outqh, message_size, message_contents, false, true);
if (res != SHM_MQ_SUCCESS)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -97,7 +97,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
break;
/* Send it back out. */
- res = shm_mq_send(outqh, len, data, false);
+ res = shm_mq_send(outqh, len, data, false, true);
if (res != SHM_MQ_SUCCESS)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -177,7 +177,8 @@ test_shm_mq_pipelined(PG_FUNCTION_ARGS)
*/
if (send_count < loop_count)
{
- res = shm_mq_send(outqh, message_size, message_contents, true);
+ res = shm_mq_send(outqh, message_size, message_contents, true,
+ true);
if (res == SHM_MQ_SUCCESS)
{
++send_count;
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 2180776..9b037b9 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -190,7 +190,7 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
break;
/* Send it back out. */
- res = shm_mq_send(outqh, len, data, false);
+ res = shm_mq_send(outqh, len, data, false, true);
if (res != SHM_MQ_SUCCESS)
break;
}
--
1.8.3.1
v1-0002-poc-test-parallel_tuple_queue_size.patchtext/x-patch; charset=US-ASCII; name=v1-0002-poc-test-parallel_tuple_queue_size.patchDownload
From 455026b0f70eec8acf3565824c03a9e20588e9cc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 26 Jul 2021 20:18:48 +0530
Subject: [PATCH v1 2/2] poc-test-parallel_tuple_queue_size
---
src/backend/executor/execParallel.c | 3 ++-
src/backend/utils/misc/guc.c | 10 ++++++++++
src/include/storage/pg_shmem.h | 1 +
3 files changed, 13 insertions(+), 1 deletion(-)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8a4a40..f9dd5fc 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
#include "utils/memutils.h"
#include "utils/snapmgr.h"
+int parallel_tuple_queue_size = 65536;
/*
* Magic numbers for parallel executor communication. We use constants
* greater than any 32-bit integer here so that values < 2^32 can be used
@@ -67,7 +68,7 @@
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
-#define PARALLEL_TUPLE_QUEUE_SIZE 65536
+#define PARALLEL_TUPLE_QUEUE_SIZE parallel_tuple_queue_size * 1024L
/*
* Fixed-size random stuff that we need to pass to parallel workers.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c339acf..4d5dca5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3346,6 +3346,16 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"parallel_tuple_queue_size", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the parallel tuple queue size."),
+ GUC_UNIT_KB
+ },
+ ¶llel_tuple_queue_size,
+ 64, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL
+ },
+
+ {
{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
NULL,
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 059df1b..9182a0e 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern int shared_memory_type;
extern int huge_pages;
extern int huge_page_size;
+extern int parallel_tuple_queue_size;
/* Possible values for huge_pages */
typedef enum
--
1.8.3.1
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.
Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual
rows=200000000 loops=1) │
│ Workers Planned: 2
│
│ Workers Launched: 2
│
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333
width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms
│
│ Execution Time: 24954.012 ms
│└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?
Looking at a profile I see the biggest bottleneck in the leader (which is
the
bottleneck as soon as the worker count is increased) to be reading the
length
word of the message. I do see shm_mq_receive_bytes() in the profile, but
the
costly part there is the "read % (uint64) ringsize" - divisions are slow.
We
could just compute a mask instead of the size.
Yeah that could be done, I can test with this change as well that how much
we gain with this.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly
the
ringsize (or mask).
Good suggestion, I will do some experiments around this.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.
I'm afraid that makes the profiles just about meaningless :(.
Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?
It was without any patches. I'll try the patch in a bit.
Greetings,
Andres Freund
On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)"for
me.
Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.I'm afraid that makes the profiles just about meaningless :(.
Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 9/8/21 9:40 AM, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. Thepg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functionswhen using
--call-graph=dwarf, they show up like "pg_atomic_read_u64
(inlined)" for
me.
Yeah, for profiling generally I build without optimizations so
that I can
see all the functions in the stack, so yeah profile results are
without
optimizations build but the performance results are with optimizations
build.I'm afraid that makes the profiles just about meaningless :(.
Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.
IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.
It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.
As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.
Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.
The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
results.tgzapplication/x-compressed-tar; name=results.tgzDownload
� ��m����&�k�����3�vU����G}�.�����u��wvV��J�uT��[������}��� �@�d�%f����3ss��>���z��g��C��p��0�U�g�Hi�7��� ���?��uo+��>�v?����������M��F�������?����O��{����o�����{�n'���~7��q������^��^�����j��><������������'�?���{������������<<>�?����W��������?������?�����b������������p����������������Gz�'_���/���������;��������������w��'���\��:��W�W��
��'������?�A/�W�~��������'�{m}�/w�o�_=��L�o�F~xm������?����}�w����e����?������w��n������_�_����p[�x��O������k=����������?9���������{���?�~����������������_�~��W������� ?o&���q��N��N��k����5���������{|���w�Uo���_��/��//�A��>��}{�&���y������?���w����O�����?�����_>���ww�����?��{��7�/|����Oyxu��/o����?������pt��/������������������}��W_|����#}�����q���������B�r��������;�����GS���� �������?���������vw�����_�5��������3�'�(�~�����C�n��t��o�Wk���_��}|��z���o������v���?���J���H��?���o�����R���������;�2|]��w�E?/������������z
�om��� �����9��N�����U=��������/�1�O
%�W�������[���;�h����kT��+RCjT6Z���*�4)�?��;}����O�c���3>������`�w�_�o E�����#��_��y��������w��{�s|�k�U�
�^��N/�������g����~GNEs1t��K�\HT*F|������/������o����������������#��k'f�5�b_��{�n�8v��������*|����[�q�QYtD>����C���b��O��~||w��B��#����~�������y�;�^�`E��^�$]�O��!Z������������U^�
W�1W*T�i���d�k�J;|��7�+E��J��J������q�����n�+e7\��%���q����5q��.#���JD����W���`�X�^!2x�2{\Y��b�U~��W��
��,���U����hws�
�+
\)����
�Y���J�����p%�f�F�~`�"��Q�U��I~`0�g��L4����]W��R\�n�<r��[��%��GO�
z�� �W� �L�����v�3!!�:UY�E�� �AA�{��������$���Z��M������H��!���q�<U.���!�����M���C���&lr����I����P��1"�����N1�������� 79\B*���:�>t$���b9�Bg��^BQ�xrh�vY/�x�r(d��B�s�!\�����Ce�vYh��]������_������E��b"���hC��`2R
����� 9+��&�ic4�5��[����{�����SG����H�m�a4��`���m��T+�f���EO"�6P��B���J�(� ����g*\P~k[�(�P�,��� <D�����zu��WR�x���h�<���*�x�x�U�N�/;p������ ���a�U����K�i@��L��G���j���(��5`�Y��
j^��b�|����zCTn�ET�8"*V�����3�@8q;�&%e,v�b.7f9�w�&J��1�(����QORGQ3������&sT���W^���%&
��b����)5�*�GU���t��S��B�B�i�7�
����p����<X�c��B72��J�l bc����}��)��i����S�����lg[�<X�QT!���`�`%#�6d�@*^%V
�/W��O��0��N�j�����:�8�fv]
�L�Z��6X�����t oPk^!V�`���UX�bk����
�����+��VFnk��1V�*Y���Ob���d���������U��*+�f5� ��Q
+����Q8����I��4�VXR]LFaz
�*m+yQ��&!|��p&�x���SqS��(`��SPoR�� ��N���m1�����I8�N-�J��`B��K!�����<�t�)m�e$�,�)M;��������B
<6�����R.W��\'��,J��k�R�d���YTm�����}�L�����R5������k�R���R*�����z��*W���!eZ�z���46H���9��^�R���E��!�a3��2���r��P�0���!��U�nV*;�C�]��4������� �O���<���S���9�f��w�&�!es,z��n#��.��#T)��,�����+�bXNW�37Z�4��5:�:�����TV?�vi����w���] [���i�4���g�0�?����������a���xF�m�k�? Zk��*VX�
BA{�����X��AHx��Rj�!��+[�-�#u����n���pB
������� T>m����<�,�V�+&��e@3���K���a����}�$��]u����w �Ff��� �J��'���B�����V4��9�;�`��>�|�53�?
�*t���_wM��N\�)��>z�T�t��g.0�u9�9@X���=�@����A�>�4JIRs4 5 ��� ��8"��KM"���V/��x�@g������- �����V��V,n;��������Av�kQ����P�|��c�����]���3���j����:X?mb���
�����Na�;M�"e95���2�u�g�$x<�5-�U~����)�X�-c]����$�DO���2�m��Me�2b�����������3[��J���V9g���r��ms;x��|NB�R��1"RZ������~Z��G��u]���w��E��FA`�]v5�F�f���|��#�O'meM9�T.�cZ�����`�0T����pd, C� n`�3�J��P���0�M��PQ���R�TA= C��p��d�#u�fK�ah2"<�S��4g6��
�J�1
��E���K`�����.d�Z�����%��k�Cx0D��28�U����R/�S�4�j��0��:��h��K(3�8��B��E`�p���a�<bK�
�ccc���=X��*���0h�����'���0�>i����f:��L����'���G)���Us�\!
�t�XL��J������l�|��kL���+�l���ui"���w�L�p���=��L���f�|(lp���!@N��G�"0�%�|�e`(��<��p���9}.�>F�-�����
C����=�[�O��y�l�TV�#���Z`��y����A����U448�����=�-@3:�a$E�}N�V�0�m��|�F3����`t9uy�p��L�2.�E`�p�����PnU
�c�4��������o
���}��
�P&�sh
�an�g���5�Ja���$,7���H�h�������PWQ���t�1�CY���jT�9��b�������CbD��\�(���'��h�p�����5��HTm$"K�����S���}�VD"�*=8���DE���o5�H4*�P�Z�q�0!���N*
��6$��������������c�q�9$J'(�TZ_lc����e��%��Z������� -��5���)��!��QA6�{��hL�0���\L1��L�D��v�|�{[)���8�����84��ol-?���8���J���Qf6/p�]Y��ep�&��aU���L ��h�����h
66���.a5xDYY����X�����8l�B���I�.����we��G�](L�1�I!��}��|L9r���~8���G��^,&���1��c�Tx�����0���;�c>tm����IXT����U��sy>C.t�x�:��
�[�"0�"{��Q�s@����x�P����;:�n���zQ^M�:&��w.�@�U�����L�#N�9=��,�M�cB���#|�[U��X��FCVWU��f�u��Lm��p'���T���q��;1��-g���� zL��6tm�����]�>��Y��\�kTjb���[a�����J[nuM�a�� ��62�n�,�������i�A���Ws4mZ�u�%���#����$L��������S�:��9�P��^+S���bl�d��"����D��c��jqY�,�4�3[zV���_��'ae��no�M�_.$����W��H,)�&����u��bY��c9��N=���M���a"Q��Avc{AY��E����>����>p��t.�3�u?�m�/������\H�M��L8�n�+�xS����e���� ��e��X����Bk�V�����+�GS0P�$��� x���kLG�iu�1�tyf���P\i4���|��� fO2Y1\wF�O�;;KXI7�����L`�%_��A�����eEd?�����.�U��c��_����-ec9�^��M�j��d?%�1��*�q��Awd, ]Y)G���+^
�����-���� ��r��\
N��v��n
=&t[���2b�9�X�h|��!��i �F��{`��!��v��t�]�p
Mt�n�[C� ]��.m��X]�X]��LX��$�Z%n����}��l2��4�!\;t���2���s�j��s(�x���N�
��Mg?a�Fk����5������hlh�����H�{'z��3�������PA�D{y���5���Um�J�Q������Eb�*��Z�B71��C���.C��ByS�,wLt�z�J5��v�V�
����;:�n�WQ}��"�\� ]t��o]����]E]����[`�����g�n����KvkC8>�C�7������]KV�.Or���5t�.���9���5��B7A� ]���WtG�"k]i}Z��!���J>����+A��b�������W�J�+��0HT�����Q��X[�P��t��TN81;��a����(���5M�������z3��$�R��!���K�s[����5���=��(k�E�G��E~'����f�A/�� @��w/�^J�W�
M�4z���C�T��N����E�n�S6���E�K`)�����W'�����? ���������/L������
c�����l������!q9��f^�T�l�5��4W�^tP�U��������:\�����E�!fe�+l��m�P���'%L�3���^�Sh��k`Df����,��M��^��+��k���W��(c���pt����(gU�u�������"y��G-��~�Rd��������9�/�-��ye�6���+lG�t����in)���3��l�y��Oz��7Iz���Iz��7Jz���Jz�:�s~V0E5U|x�3�2l0�2��1_g� K��H`�c�of���������d���[�
�k1�����3�Jn�����L��i����H��.��H�zMsu���N����#XT8QL?�l��f���F�)���L�q:YY�UJ���:���S����ydM��qON���-/�3��e���=^��tZ���k������1���`d����$��E�"��-�?_g��uB�K�l�
���Xg����36?cHg��/�=�n|����M���)��v�y����j8��q�B�O�P���5�'��e'x��*��������/L��$��4������*'7��c�Yz����lR��h� mJ�S(�]8��:t����-L��6����V�E�&+��3� ��(��HIw�esY�m�����}[��mm�&�Lm^,������?����ew��D�![*�x�ms9�����mm�&����aA�|G���6Le��6<8�=��Q+@i���I�������k��?�%����n�1[dut�#��R
�Yl�&#�."�L"��O����5��>�y�7�/O��y��g����++��/����Hl��EV��K��:�����-�tt��J�B�}�t��.e��O���]Zg`DZ�*�b���W�3����A��0�����QD����������4BU.�Lu�!�����34�n�2<�32��4�v��4O@g$�3u�i�M�[>@c���$zX�`�g����3p7C!�����3�Q�b�a���i�����D:v<��ur�l:cd,�3 ��7��1+��:C�:C ~���uFZ�x��u���$�L{�9���*����~+�����mt�u�� QM�M���O3�b�D��1�]��- �?#}i~���#?��������M��B��Q�l:c(#=a��3Z�rn������3�@����3���J��L7�:u��ly���c��v[���p<V�~��(�}������u��p�,w��@J��e|�7 ���Pu�� E����}������ �L�a�:Cl�o�c��L:�����3BbR��}��~�8D
Y�oCn����l~FFgN�[���?�-�1:�
t��8mE�~����X�x�V1L��e�3�+�3r�}X �F��'�3��:����� ]�����:��2��G��n�&�M���@ep������5��n�Z_I}Q����'(���G�&��kE�l������jx�$f*���j���%�su�!�dvW���:+�L��������un�FYE�g\^gH����U�3d���:l����� �\���:��kN�z:�WG\�(]Y&��I~F���67�3�;��}����g�To,�/3:!���N"J_sR��`���Rb+�Uu��)O(Q���H~�m�3�uF�N���M2�K�f�t�/���+���tF�6�$�&�Q���D��������M~�h�u�������������q�x�k��aVW��5�����D�� �qg#��y�N��X�20:N�\��R��@��i�����{����-lq,�U[�����*3L5��tS*� �#�$���E�"�
yL�,
��Ey�� ,h9+���7���
��a���r1,(����e�+���` ���d���X*�P�X�.K�j ��`r�i{,��Bo�gE`��eY��F!����%�B�dj�5��/ 5�!�r�����#�kV�W��,U�����l#M�E
'�k�FY�
�`��@�E��f�����C"�$�{��XZD�r����5��!����3>���
�.���"G�, G��A��]T)?9��8�u9����39�AL��rf-O���>s�.+���lp��N�x����D'��,�/eM���+]�����vmr��g�n�3�M�Y~q��Z�n���������\�f���Ym7�Y^���0�r6��l)�9\�0 g������g���B7*���@Q ����d���n������'U�ek�����-��)���=� ��E�ir��������,��.��]4�b��}z
��g�� ,�������HuC` ,�,��c0��,w,���Y��6��\����E��|���`��u�-����Ss����������6�d�,��`1��U`�]��� X<wW��`1�,!�w��}2�,Kk��E7�,E�E���,H��`���rW[
�KN�����n�.��������j��X`�@_������]qU��l�|��e1�+n-wW|ij��= G#`1�&\�4������2r��$���E��"�����I�~9X�K`1����L!X��VL���B�$�} ��+T�?c�ZR�����Es�����`����nX�xX�,Nm�,��X�e��,H��`�L�����R�dl��k>X�*�g�Q6b��-�$X$�;��s,�I`q7����X��m`�qv�8I|X�\��S[E�(Z8�0U��R)�Z�cIfUd�a����-|�!�i�TT������������E?����E���i��j��
&��nKf(qtq#���q�#����X���eA�X��%���h i���Zt�V��n�,�/���hi��o����E�dJ��.� 8J����h���aj8�sT� �(.��s��]�������mJ`Q���n������e�?4�n��E��V��x��� ��Y�W�-���-IG����t���(O��+�~K��[x9�z*��f�[���}����{�Y��,$�2�v�60���S�
E�x����V@@�5�t�3���%����\Ha��E��lQ�e�R.��l��-<������>������r�y#�$I^�Y>E(��9d)QJ��i���:���q|��GV;��F��FV]�f+���,���+N���:��vI"�Ide��u���L��H��#�M"f�h������*�����1�D�[�Y��{6[}V�e�s�����0��L=�h��%���������+����\dF������'��7���'Y-��$X�l��-���|d�D�ku��I&�}2,� >����}�����'���X�%���ydR����;�\��l���e�u7�\M&3�=9��uf�t�}��m=y-��$�2�J����O��~����q����QO�F����*M��T��9*]H���FS��Yo������J1u��Krc�wB��}�n_�0�����%�b}�~Y*�.��^$z`'����N�+�m?kd����z�
���)d�2d)�c� ����a<�����.����smV�����,C5JVs���Y��f�\
Qe����`"��K��(�{��J �A�g�R������k@�)�Y���A��fMT����N�+��?od��l�F��,��������R�����g�Y�2d���-��7���2!�~O!KU.��J�CV�����������n�/�e�����_H>�`'��!+������
|��0��U��A���9��YtydiQ��\f�(l��&�����m���F��\d)����M�m��Bu������b)�V6gO�26��lVim�kn�l�������x���e�r���f9�~� ���,?�
&���!��F0�(����*�����RC���AI4w^��:��A�����9'�&:t�><c7h�����K�Amo���
��*�w�
���Dg�V�Z�
�-��s�h��?����p�;Z6��h�VZ��qGk��Z�4����?�&8yJ��D��[Z����:�� ��-�l��*��;3��1-��[�So���Hy��Z6q����Z�-��-.lt�PFi��1��Y�0��Z&��xB��r Z7h�z�I��r��!�mS+�c�S�0��
Z���� I���.��6�VKP�0j�Z���-h_��O�
�p;].�D�����=7L���Lg��=Z����?���b1���A'�1��� ��Ro���R�>��Ud�Q$�wi���hJ����
����y�'�������X;{c�u�iM��{��c����x^�uS�B�<n0��K�G':���
�(�9��1�2�
q����ga<���
���?������'��Y�Lg��b68��\8���YG)������O��.d��ca��Y�P�#���f.����:4�������-���E&�'�4��O�'�aA��L�T��3�45�� ���Y�:� � A�WW.,�*�
rflF���dg�k��Y�5���L�G��'Anz@�.��
��$�� �NF{&�<�6t$�-fF�6A>M#���Oaf<A�},e�5*��Og�;� ;� ��F�\�SYJd�u.����������9���nK���~�������`g"EF�ge�u>}*����tJL��)1#��YO�����D<=��,����}��8�h�����d!��
����]��G�i�2���mc���������#8������hzc&�$�5�1?���X[GY�L�����}{���}{�6R���kSiD�5���Q���K�F����>-st_�_����b��]Na68��B���r�YpL�::��(,]�:�v���r8"�
�8@��9����cL�M����O���|�����4�PU`�r�&ii�t��e���I7�I���������1A��6�������L8zUw�e��q���Mp�4�&��AS�r&�\(G���i����<Gpl�n�-��s��Lf����QU�7���+����0G��:�� ��c��Sn�L�1��@F:^ �Y��(��%8u��I&2�+��,�g9&�\6U�����6��4 ��:�tVu!��|�����cy��1+�Fe}4��t$���y�GNh�P���[���h�O��������>�c;/Gl�jo���������K�K�Q>S]�����</G��cL��+��(vW��Gm��co��p ��e�� oq����Me3f4�:�>�2s����?��Q�7K��}��_��4��T�|f��3j�����xl���
��1�)5G���Z�u\���)x,���VH�S��s����F�Q1y�cg��u*$L��pt���Q
1�2������P'��J�;�-�p������=�p4�����a8������?�=��$�]��&�}��(����*���t(��9�0����#5��"m4��1S
4 u,�+�Hv���������
y�������|j��q^�#5�^�-3�7f��Y� ��Lf�� ��pn��~j �7�T�0r�] &��A��$��xF��0�s�@b���������7�(�y#j@�����@+T��^��j@�E���R��������j`8C"k�^=50���RG���/"�J�>���I���)�����~� ?��U&�O�b3���?����I����<�X(jQ�z�lI�sB��D��O��>E"g>X&�KHe�����"DE-�a�,���B�����Pd3-�Q���p�jC�0�&���P�";vq���sB��e[�l1��(2����ytz�f
O E��ZL�~�=���"��-�B�F�/E����$Z�
E���|��G�l`R�|���+��~O��$����|�T��YJqb�g�cAFB��}���7d�>-���6��i��il�d���e�P5����Q�g�,�4
^�G
x�������WZ�25`3e30�j�4�'��D��#5�[��V��3��G�@C|q�y��,�z�
��p�j�W���n���ao`���H
��76J��������*��2[
D����5�C�&�����{�F�Pe��5 75�s
�%,3��� ���[�pQ5�(6���P���k�x?5 ���}��}xo���M%�����:5�2�<D;\2<x�����)��7���IG��~�$\�Q�8������e�h��;��#��,5��h�f�0��H�:J^����L(%f��g���mj�H
��m�c��������j P��q��>,5 Lb���(^��ej���6���N�0m(�=�iwJnj�;�����cx-�Xj���
\�kK
P���M{�5���w����D�P�Z��No�����<�jcY��8j@F����[�xj ��r�S�*�7������>-���������c()���g�Z��z��;���U��1�TYX�MmY����M���@��������mb��[��a0��2��<��m~�G D���>
?&�n���17u@��f���@��7����P���� &�mz`P �yz��I�������z@��B��`��P��M��uA��z ��������@q�@+�Poq�������f�u�N����^^`w�B Wf�VX�)3��U��5<�F�F�����|}@��J�0�j�������>�* ��e���W��be%�������]m�%n �3���2���GD�Fn_%B�����Fun~�
Q�qMx!��Ghk����:�M���FM�
o�a�m�2�l���v��x ��;��E�!�LX��n���|KQ��[�n������x�46�u��b�]@��Q�U�� 7��%@�&��o��R�nZd�x�y@�2��1��pa�m^��@��lp���
y�%���7�j���������3)C�n��[+�[��&��)[�����n�(���xs&�p���Ut���$R�����on����
�a�����|��d�1��p)�[���P~z�N0�qv)��A2B�}W���
�)
�9d-��h�.5�,!UU9+7�L�����+�A��F���yH���,o
��lF�*
*�J's�e��s�5�\61apH6��M_��l�����C�����M6�MYJb��pV�t�0d�
���v�l&�b��!!�8����?��l�ETXH�N�?���`6]� ����&�lz�-�l��l�Q���e�T�h�M<8�M(�������9��ed3Sf6,�=�j|�(�
�?���Q���I��JG�{���p+�\��Vv/2)��%�� ������C;~��w�v��;�d��4�{��F@e�8�^��1n��D�a�X�U��cCZi��Z
�ip'D~���T������O����98��
�x���eDq���l������$�i��65�#M �iD���r<Fv.��A�P�P�qSH i�3l��l\�t�BJ�1��Ha"��Gx��>���6�<��!�&&�i$��B�E�������i�4�����0���!8KU���U������o��a�%
:�{�c�����2�7��h�.d4���Tph5P���c�
�o�D��������Zs86iD�H_Kc���r1zW�FFEBT`Np �&������y1�1�YS�E�1F��M���vey#pk����>*�y��{�p���p�N��q�p����aax[pS7Y�f�Y7� ��|+9;���n�`�|W!�����7gA����.TQ�
J�p�0J��K^n��*��q/���9���]w��e�\�p��G�}��Z�mp��M�:to���f�,7L:{"��h�q}pSd�4�U9�Tn8�z�[&�������e�����.�� s��Gfm��28�77,[Q���SD9��m�o;;&���n}vR�4�u
��Rh�<���H���J5G��#k�(q�E�WLK���$�4]H$�>a;[�1�����(i=@��0�3-���J6+��F�T��A���F$k�pa���8����Y�`���
&�S��I�o���M������vq�a���|1�����6����pQ��q�9VZ�m(kP<���A�*�rn�P�h�J`:�-3����a3�%���+d��I(����2�=�J�
f{�������1������8_e�)�9cf�N�l���� e�tea��Rb���e�bh����~VumE=l|�I[& e&��2�7�l���T
H&Pfs�������� h8z`S�'H���]�I�x3� �6 J�����e��9w5\�#�#&���Fl���� ����c���t������i�)/�u�8��qp�������R�7p�9Qi�e`�x,�|p&�V��KwBv����B������d� S�K� g
.&8���(1�����8��JP
n�����i]��SGo�
l�����|(g,61���3�����`���C����_{���� �M*W5���� �D�P"{��� NY�CxWN��!��W���8| �Y�� �}\'����:����Y��
7� ��lQ��M�V&�6J���p������q-g
�bp�x%=&��y��xz �$B$\���������N�I�_m�T ��MYaK��nt����o���F�����,�
����ZF�56!-OG����e�X������ �ZF1��;}NQ�n�Wz��S^�(��b��1u6�1Y*���Y
�\�ri�:-������}�T�4�3x ��B��sW���Ir2���NF�8>FV1�[���('Qd�����7�Y�����2n�S*+�B�~�(�C��"Q��T�m��.'Q,��f�h�Z��O=2�0��"����[����|cM$ZNd�w�i��D�����R^�4UK�RhfjOCSY��s�&My*�=K����������l���:]����8q?pmtjj��ya�
�N@'����\�DX�����)gC�]?<�~!�gB���~������y�J5�R�(,G#�X��(y��l�RsUr�N*}�<Ri�4:<(F��$�i�����5��v��`����v���4N�6=n?:YY���=��b���*��N�x�<8��0��2������)�g
.&8�p�q5��
���SSO{��p�V�q�y�|�w����rp�Lx��<�-7�W18{��7-�:KF�QR�2��l'��
J�@��U����e
�Z������*�� jz�5�1�����5U�Pc\���)7e�^�X�6�:�@���+�%)��x�%��E��U�dH5F�
/��U������qsP���n�b�j��*N���]-����>@��Jo����TS�/�:��X��e�jh���b��E>Tq�����r��c������d��T��Au:��y�BVR��j0u�U����c5��S=����[D����+���!S��A��:�L.����(�
�k��J��WO�N����vyi�|������$;#�<d��p����lv�FW�����i�t.�����zP�#�"��4H�QN{�l�D��X����_� ��������R;-G�*O�j�j��1��\j^Ho�.*�Dy�p I��3a���J���:@U���i��U*��z���46�v��U��P&
E�6T
����S(�.U�I�a�n9TM�')��>�����p� �C$I�s�V{�
�G!_Y92u����S>T��X{Bb�5@�j�me�d@5�lsn�GI�W��:���!��t��=��t��4��P�aM}��E�O���r�6�_Js � ��H�.���sn�H��UH��J
��2����d������u��x��|tFIy����sk��2+1q�aG3��4��K�����h�������R�H�����u`�V��o����`������/��&�f���.�TZd/F�!�i�$�|���4�P�����a��1j�1\XT�U�%d}fuJ3���A~8���G��^,&@��1'��r��`��U�&.�7x��%�������� Y�'6�S�K�[�!�"�qE����*��x������s��d�z����!`�#�W�
���m�G�
��.T!�B�����hu0����+��! p>D9�U�z����~��w
��#�97x���o=!x��vtw%}�M�[����7���sJ�����7��=}��,���.r�Lx�d�a��FHQ4q�����3"b�W���7����)|��7��m���"��L���;�u�9�� � �0�E��w� X*p��\!��d��,�1�x��� v�U�c7�B����4�y��x����e���Dq��9���o+�BN�V���%1�tc�l�E� H��C��N�����\F���|&�@�_&S�n}��y��������M��X*�5+�4<s4`wI��6�Q��{Y�`�6�*�t���AV"�6���8���,h����5��~��A�^#_�����*P�G::�A�']n|&M�n}F��s�Aw���h��u�^#������V*�����+��;�W��!����0����P��i�R�RY���T�-HV2�A2���:V��pz�K��SQ �(��$�^� HF�VJa��+���OF��c ��0Q�Q(�VQf\?�q��}�+���5Ap�VX ����n�.K��i�KfP�JiV��T���\c� ������/�7Rt[o��p��n ����<���+��Kc�wOb��`I1<���%�]�z�����s�T?^���2 ��)��1���4�^\w�Y��R�|��GM�t�m��Ec��{��'�vLUaQo��M||����+��>�XGcte5�z�:*������n �&�+�}F X�5������VU>Pu�=��4}����
���M���
�j���Xo�K/��AW{x��l�[����t���[�-�V4���NI��m����T&�������J�w�2�!i'FXjl����|_��G�/�[{;��[T�=%l���R�q�*G��
[��)��)�Q3"�������N?�[9��<����!^{P)�������$�1?����;9�w����+���X�5����mb�P���`,o��3S�Qjt�����m�IYtn����(W��iL���g��
������~N����o�������B�}�@k�U���!U+y�b�����0U@�r�l+2i-4���7�j@��s.�[I^H��6�]4��78_B�%�l�o��bw��;!������������e��+�wd�������Z�8
���A�R��>�Dw���A]y��Df��s�itGG� /�������zO�hr4r�p�[$5Oo)a@S�m�rk�{���#������v�t�����o&;�E��e�����C6!���x#��� ����3�y���_5�k<�����#"p�����d���K���x�M��Z?
ks�[������������}0��~��n�����V}�p�e@z��%c���s.�E���N!zg*gn�9��se��)o�c]�z��q�i-
O�<=�`D����=x{�Ad����^��J81��7 ����U`�����7M������m�=��Vy�rm�
�i<��y�FDIb�?����#|����Z�������_&}���!A������N��(��$������\���
h������)z
��?�=��$�]��&�}��(����*���t�0���^i4�A����z�E����.W4��+����~�G�����
�W`�l� �������>���2[r8�nF����WM��p���C,��JN�koU���J wS�^`~�j�N�+N���+�y��+��,.gqv"S$�S�f�*{�z��-EM���9|pT������^���=�1������B��l�?����z������������)r�8����F�;�q��C��Z�57F����^���uj/�n|E%�ll�~�������,v�.o*�[A�z����y|�z���g��J�_)��^C��W)T&=
3 �: ��S:Dy�;/�oHG&�3����M�*�Q�wc>�2`��7+������[�<��i���C��Q��#��
����Kr}���F�k4��n�i�q�jC� k��A�>�1R-����X��V����SZd�d�$#ZF���F�AN�5�d<4��i�1N����l����lco,�FC�MU}t64:B���^93q�t�~�&O �5~Zh�<Gh��u��f���
����*FK�*��L�$4J�T]�d>G�8���aW��>V��8�nL�i{�=K[h��c0C��^s��a%mn������Z�<ni�����D�m�u��W� ���,�����M>Q'����;�r��^���Q�]���;�^i��a>�FwS6�+
���qX���-���
f�`_�x
����
�#��W��PIg"k8�Jo���D�&hq���|�����[�������,xK��.Y�'�"�uHrkO����t�%�J�����5jyo_�4���%#�����/�p[[����^��"�Z�z�Xyd��6�)GnR��
��+��l����)�ZnzehTk��:H�g��[n�������H|�Em���3��^�B��Nh�N�X��@��W��E�^q:�}�S�����^�5AK���]�~���6�R2�z�
_�%�?����Pyo���7dwV�J�2�z��`�%���t�[�p�� H8��cq[�W��D�Z���a�BS������F�
�a��C����~�n,��m*G�~��\����D�f�����&/GU��;;�e���lq���z��-���3"��M�C�[R�XF� I���QvUV:�\-����^�l|X�h|�r%fr;i������Z��
�_1�u�l���"zE��cu���_N��+|���[d�tE;a�_���Z�W�p��4�v�^%M�^MtB?�+��W�3f�W��"z�a�E8
Up���0BJ��w��I��Y���-�2��8�q��.OA�$M��+��AG�d�W�1�Nwyz��V�uD*( ��pR�($c���L����.S����[a��.�e��'(��#���c�A�d��M�+/�
��{3�5�fKsn�}���l�"h�1�����z��p+"�k�h~2zE��D"��|�#�b��D�U��W0����*�+2�o�s�J���+�"h�����9�z��cF��M�L�%��1�K$~�\5�#�?�_h�� �rJ��k�+� �`��-�+6��Ls����`S�Z��zD��r����}[�%�
�wv�+0��=���`�m��)n�U�*����_�Wr��4��~���uP�Z� ��Q� %�JDX46���+X���w<i� X�_��1jXr����
�
�N�m��h��_�9�:hxd;�^�r�|��m���������IiVo���5���-�������?�FJ��I����i���X�^����w��~�����������~�c��#c���]���6���������}s)���!�����9���l�������o������O�v����o���y�{��7?�y����������7�^���3�������O_���?���|u�����_��o�`����Wo|}������N���/���O����������?����f\�����W��7_q�3M���������/���Az�w�^|���'�E_|�?��������_���������������Wp�������:U���x�/���>o���/?|�������9�����7o�_������������o~���_�A������g~�������G:
�|������������������O�7����o�!��x���������o���^�}|��=�������n�s���[�(�����=�_�?~|���{{���^������o��>�>����{|y�����7?>���������Wo_~7�����?~���5��<��o�����-������{��������M||�
>���;�
��������?��x��}��>��p��^�}|
o�?��#\��*���{����?���A�����������_~��������^��{x���������������K�������������w�%�������E����k������?��w
���o������^�}��������$��5|�o�� �����=|���}�_��I���o�o������a-���������.����v{`��7j���J(�R������L����t���������Ci�������l���zD��*��~
�/�����%���?�������.>��}����w������w�R�x�Q�>�_��jo������}�KA��������~
�����|��J~��!��2��&?�_���i����_�w�|����O��������s?�|�i}�o>�����7w����~����a����������nw���,���E����/k]T?��=����zu���X����r�Ov��>�6�����W%���[�{�
���~��~����+~���/����x��5����/�.�������:��}�y}�q��:����wi��c��;����A��>���������_����'������� V�S�4t�U���$�������)s�� �����Hf���.��#[Y'f>��|�����H�U����$'i�2�<��L�����4�-
_f�G�������2�Y��0�+�}��oi�2�<��L����J|�[��Z�$��^Y�_f�G����=��C���j�)Oi�#��'��ZOdE��/i�,
_f�G��!#O��G����/��#���G
��4t�u�}&����4|�� ��Ws�K��4t�� ;���a���.��#�T!��:=s�u�&c��LG\M)�����H�cy���7|����J�������H=d?{m1%x��Y��@�������#
]f�G�X�v�h�2k=��~;sDSa�����H$�����i�2�<��YUy1����G�\f�G��IU ��#
_f�G�j~H����.��#�D(3w/f���.��#����������2�Y�����\��\f�G�_����P�����H�3%+?7$>�H��Y���������i�2�<��L����o�m�\f�G��5�[���g.��#Yl8+��oib ���Z��B%�\���.��#a�Z��)�4|�u�}&ee��oi�{��Z��,������\f�GB�����T�+s�u�S�=��xS[�����HJT~���Z\_f�GB�����$��/��#�D��j�5s�u���X����4|�� �~;��N������H��t�w.�������H&������e�y����av
��#
_f�Gk!1��|{{n������������}{��H���8M���8�IA}��7�SB�����^=���?��{�H�>|�`*�7w������?�{]3��>y���7�����a�M�������'�_������#������7��@�J<�i{}�����6�z���7�����g�o��O~��?�~|��P�z�������~�����$�[��9FM��;�y0Z��h�T;�*�\���W*�=iv�b���I�'�k������sR�I+L���W��9i���s���u����`C��I��d�����J+�'
>�����gj�:���(��L����X/:'�.
0��s��+ET��_id�3���GrNt��3I-�g���S���
�g-=�=���D��<���j��L���W���[���au.;'�.�*z�:�� �-&�$O��+��b��{��I�C���Y�iK��(R�*v� E)���������'�8��):g-}���(R�*H�9�%L����t���Y��a5�s����= $�lA��iE�]�����T����L�����
cA���gu�����K�.H���BUJ�>>F8���=m����W���Hi�=K"ic���������������s$-�kD�4~�0��wO������{��b����NK���If�CT� Ih4����iM��t��IF��}���`c����`O�����pQ'��w^�y���������y���+0��������pUl�c�t�����q��FvO�%#�oM�4�:8�����K�������N�;m���{.���%���!wbL�4\��"��{.iT��SE2"]��-�V��O������t���d�a��=m���� ��n+�������w.�����k:����$Cm-�k�o�vX4�`�<
@J��h|M��j?�(�:c�]�p�����3��q�_��5N���]�`�^ �����%A�bZg����3x�R����,c�5x�Z��u���3
�H�:�����u���i��x�u?���JE�>Ew�z}�p���O�}����H|cs����3��{���/pd�}�~l�5Z�?�|��u����Zg��ku2��3�#���^m�56h��$k��{�Q���l��l�_���3�k��3�^#Zg�k�� ����3��hl���S�}J����k���O���fD�=��)�"�n�J��k���W��F�.C�T�2_]m���2�}J��B������S���zc��:�>�w�����+%t���;`\���1�S������{��}��X+������}����c���A �
�D������:|����GhlYc��A�j�JWp�St��iol��)K�������j������>����}��;����o�~�fh��������*�9e�UA�O���l�J���K�j�_'}�}�!�O%,��H��g��k����"D���8�����^�=gi:��z����:r@Z��g��X0���������{���u?��d(r�=
�f�A����tp�z���ZBvO[:��\�7�H���������lN�S"y������g[�=mwV��u�#��+���.������j��T��f���>�&
g{���U�z�����������7�;OPje�<<*���$k��uO���l{��5����{[���,<v:�Lc2���v����xZx�=
���A�5�Al=����J��{D�n�F
�m�4.��5����b�����1�h/�������ir��D�(��i�
e�<.�A2t��(G*���i\q�����&�^�Wm��R��e�\�9��������z�qe����B
H���4����.�#|T�n�@�����.��{��H���u����{;��\\)�"k{0������x
���"U 3��8D�������.��A�&����+�U[��''QS RR�������N�d����`pO���OXi��Z��+���������=�v����3 �G����<l�A�0�v�x������Z���I��J]p���w���/�f���6����}jD����'R*�������w�%q�wX�X�������~��6]�kl"��]�{��q��Z`����������?P�"����0v�����@�����/��k�M�Z1���������vH���}C��_��z�
�W?�=�����t�k��{ks��8����,�~��k�T3���_���������w��@�������S��?<�x'���s���J��Jr�[�t[W[Z���Xi��o�k�^��n�Q�m�������(|=q��z?�.Q���[��)�A���wt�0c�"�WD����s&��-!U7�P��C}"���F�l����7�T�������s��9n��a�uya�g�"��� + �w7+oq�u��R9X�6���<X���Rq�v��k��:�Qm���
`e@�����������jD��g��V�+�W���*�
^dk'� �&�� X�p{��'X9U
+�EVX�2V��mkM$���*;f�JP�_+%�����<f����Ur������aX�4��k������T���V3�t $��t��h<�cz
�
��x�11�^g
A$�N�FA��(�� �=�����rXKOU[}��p����R g=�%�����P.�~X��c\B}d��n���f����T�����i��CTd�r����Ci�r����C���f���s�CW�ew.����_�r(3r���%�%�'��-��������+X�����CYi��C���p�z��V��:�aq�:��%V�^���3����>(����vTex�����R�U����G����?����;�u�D4-R���z�m�a��2�4�)�sW�5xh�������
M��IP>;�0��Fc� ��=^�E�4����s14x�d�f��xh��.�HcC��/m�\�6y\&`�?�����-FS��&�c-f����LMQ���� ,�u���2��>�C�����N���
4IQ��z�e MT?� ��*�6
����Z��i4I,Q�6)���g=%� -o
M���t�m�����^�C)�T/�eK��8��TU�M`
����(���������t._���5)��T���t��1�Z[��F}�zT�&
g��O�I��aj�@��6L���:���W�.���o�n S>a�)�(�T�9LY1��K��a����(7L����2���0�</���)� S&�,��/�Tf
g9�m��J��a������72�T��0��YS&�� �}���K�j��,Li7;��FS�m������������\>�eL�����F1T1�������G ���B�N���������c��3[�D�AA���Q��3,Hu�0�K�Q��~qQ���_��So*��Bl�ws���#��Yg��J�`"��.�6De�LD���HuG���*>�"���B���CT
��|�F'��_K%H0uP����M@�yD!��H��~ED)Q����#�QB_7R9�(�!s�I��#~5$��2-D�Z���m���E+�XA��>*}�^�W�6J��:M������`"j�p��7D��|��mT�%���E�Q�A�Dql��
��a��s��8�l��x�������:���B�i����.�qeT%����GV|�w����T'�#;v��Bki���wx��]�����������2�\���
h�a��mn��k�@��e�-\6��� ���I���;V�-�*���h����� �9BZ��������e�YK�0��B�a�z��X�1�BYIlQ�O��yij���sWNi3�X0�A�m�-���B �GL<����$��BS�� eN�r�� �u�F�Pyi�DJ�JA�t��L���g a
#m��������pB_EZ��Xy��%�R$wT���xi"-!����5���M�iuNK�'��l�;�I�4o���5�C��MB
�?���rP���&'�T�aO��VQ�P�u!��b[OlUKl�X|�<�w�%� ��-.��b��\lC��Z�3������ ���M�.#��2����>�3����������#��u]�y��{��=$`S�Ml��XY���)���i[����p��������\b��'��-�r���[pE���|��
�����'���O�&�+��!���}����8X��u}4)�6bI 2���8VJ������T���<%N���,��I4V�����I#3��0�u/�sF�`x '�������Bmj�7m�x�I.
�jA����P��Wb_X3�B�Yyh���Ba
#������[j= :Qy��
�X�J(�����;��P�R�:c�9���pZ�L�o�������������\R_E1�M=//�CM�Q���_�si����]lu�C����J�qI�x0�-���k�!y(��2C�����������r��U��Q+�Rch2QE�i'����qI��".�9"� ��n����(�#���h�*T���G�g���{�i
b���{A����w�KAh`���b��b�����S�L�������B-�,��� ]e���k�e!�������e��A��z�J�����t!E�2 ��<��G5��v��pt,BA���$�.��#������6���ZU
B-3Y�8���}!��5a���m�����%�Z(�m
.��A(/O����x� B�4����pT�%����a���,���U�>:W�h������@<��� � ���:WO�3�WO�`T
�>��TX��t�z<U:�T�G����R�b��"k�]cz�"����Up|J$)�p��ky����Rg��t������b&8s����)j$1���H�p�W::@�Os�Q3���H�5���c6��(S[��3��`2tJ8��l�/
��H<4(/�l����DmkB3
�w�� D����
����+ b2����- ��d��\{Y�n�M�$&M�7]��v�!�K�7EH��oJ�W!�-ZD
��C��i Z��Mm�����N��!�|S\�n���X�7���M��l�4&��#����HT�L�k�AV��F��8[��A������FM�uGF��G��WU�u���92�L"-\o� �{���f/D�|)�T(~s���_���������n��$L�2'��BR�!��?<��#���>�?��[#� �}x'X}l�M�c>p��| 6_O�W���]���� �_���pp���+�s 7!� �}4�+��o� sJ���
�r����p���~�n���p���f����5���5-�+6����p=�ETk���p������ �Z�b�����D������b�����hSDB��U���&��*k^�,��� >��'� pq\e���������'�� y����F��X�x� �GYkP�`a�M��d�������I�^j����G��J��2;�����yvY�a���H�������\B�M��=�A�-.���1�0 ���T��rW-�IX�����n�<OY������s�:e�)l�� �R@��x��nQ�o+��m>�������Z:����u�u���A7l���um����c�0�%�*lz��R�N�tp
G1[��$�2d7��dS�r�n��_O�Z,�=�Xl~%�h�����uP�>�D����"��������t�������<?lo~(�4
V��5M�2A%�hs���5���m�����-�pt�Gn�nhQ�;TQ��\p��
P�g��"�Ye���x�L�(N���pp=&tu�[���� t����k�����
������C���
���9o':���f�Nw\>����\��D��� tC�5b
��/{��/+t�-���A������d���1��D��N4n�����XGH���~ut����U�p#V��5�n����b�U3��X��*�����d��X5� �9�[C��=���A=o�}�F�}���9�D.�M��X^gyF�s�i�b#-�9�X��{�D�4_v�pK��rCg�_���3��#�����!H%VSn���kT�����J�&r���{��i=����B�0�W��l� L���+g�� zL��[R0n� WT�kite��<��@e���7�\j�f<uy/D��q��D��^���c���c+7>��1�H���JHZ��hr=��_���mr�=�r��F�zJ�D�`�>p����-���o������5&��� 6s�X�H�\���9�*o������ 7�C�.�U�b��+�1[0�8�Awt����u�4MBe�5sJ Q��kfQ�^�O?�<
To���L�@*
z�^P���<*�j{�[�5���=��q���,F�"���1�=��`Q�3�k��5��~x�
��iW74�w�&o
�T����o��E�!�J`������v�b��-�!f���"�7��/���U��k}1�5�R4�)��j�[����C>|�v����7�����u����:��-z�u�q������P�����j�[����CJ�����wt,�9Sx�^��:�q�K�`{��I������8�^�I���63���� |��UY�dT�Vo������D9����^� k(i��*���{���U��� ����C��Juo^Y�
5r�w��8�MJy������k�z�'�G��$�K��$�O�%�S�[%�W���9?�uF�y��h�sa����at,�34,
�.�@��6[g�,Mw{:��a�|S�T@�s�\��b��k��y��h���a��c�a*a��1[^x��3dC�o�r+:�P�u���fE������ �L�q�a�UX�E0�Me�>�-:Hqh�����T|���sa��TF��Ded�������^N�<Sct�;��*MF�"*�%:/��g�����������>wi����Zfg{8�������h0_�3z��2xra��s���EO;�<F�y�Im�����f%��l�������)�M��=��������}��������-L�b���������D�6w
�&m6�h�����-�%��{��'�6W��~�6��mR-_���t�����\��E�ms��$����2�M����#������_>��9�-�i�l,:8�$��6���JVK ��)���^�:� �-����C�'`���z,�6�f�����8��)���X�
w����k4%xhKh)F[(nt[����GV�t(�U�MFV�I��;-,�"�w���*�k�"�,��.M���y��g��*�PN�9�#����6c^�j*�Tf���h��!��p=:� ���QU"�Hh�!0�Uc���:���Z���!3���[���nL�<Sg"VZW�m�E�c������1<