Gather performance analysis

Started by Dilip Kumarover 4 years ago41 messages
#1Dilip Kumar
dilipbalaut@gmail.com

Hi,

I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue. In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.

So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.

IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.

So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.

--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;

Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interrupt

Perf analysis: Worker Node
      - 99.14% shm_mq_sendv
         - 74.10% shm_mq_send_bytes
            + 42.35% shm_mq_inc_bytes_written
            - 32.56% pg_atomic_read_u64
               - pg_atomic_read_u64_impl
                  - 86.27% apic_timer_interrupt
            + 17.93% WaitLatch

From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.

2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.

Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions

1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.

Machine information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory. 2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.

Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)

-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS (tup_queue size= 1MB) : 36186.883 ms
4 WORKERS (tup_queue size= 4MB) : 36252.740 ms

8 WORKERS (tup_queue size= 64kB) : 42296.731 ms
8 WORKERS (tup_queue size= 1MB) : 37403.872 ms
8 WORKERS (tup_queue size= 4MB) : 39184.319 ms

16 WORKERS (tup_queue size= 64kB) : 42726.139 ms
16 WORKERS (tup_queue size= 1MB) : 36219.975 ms
16 WORKERS (tup_queue size= 4MB) : 39117.109 ms

Observation:
- There are some gains by increasing the tuple queue size but that is
limited up to 1MB, even tried with more data but the gain is not
linear and performance starts to drop after 4MB.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

Conclusion:
With the above experiments,
1) I see a huge potential in the first idea so maybe we can do more
experiments based on the prototype implemented in the first idea and
we can expand the same for the reader and we can also try out the idea
of the dsa_pointers.

2) with the second idea of tuple queue size, I see some benefit but
that is not scaling so maybe, for now, there is no much point in
pursuing in this direction, but I think in the future if we want to
implement the redistribute operator then it is must for providing an
infinite tuple queue (maybe using temp file) to avoid deadlock.

Note: POC patches are not attached, I will send them after some more
experiments and cleanup, maybe I will try to optimize the reader part
as well before sending them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#2Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#1)
1 attachment(s)
Re: Gather performance analysis

On Fri, Aug 6, 2021 at 2:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions

1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.

...

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

Here is the POC patch for the same, apart from this extreme case I am
able to see improvement with this patch for normal parallel queries as
well.

Next, I will perform some more tests with different sets of queries to
see the improvements and post the results. I will also try to
optimize the reader on the similar line.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

POC-0001-Optimize-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=POC-0001-Optimize-shm_mq_send_bytes.patchDownload
From 4d19beb74e91a8259fcfd8da7ac8b1395c5f5c6d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH] Optimize shm_mq_send_bytes

Instead of freqnetly updating the bytes written in the shared memory,
only update when it crosses some limit (4k), this will avoid frequent
cache invalidation, atomic operations and SetLatch.
---
 src/backend/access/transam/parallel.c |  4 +-
 src/backend/executor/execParallel.c   |  2 +-
 src/backend/storage/ipc/shm_mq.c      | 87 ++++++++++++++++++++++-------------
 src/include/storage/shm_mq.h          |  2 +-
 src/test/modules/test_shm_mq/setup.c  |  2 +-
 5 files changed, 59 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 3550ef1..2571807 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -431,7 +431,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 			shm_mq	   *mq;
 
 			start = error_queue_space + i * PARALLEL_ERROR_QUEUE_SIZE;
-			mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE);
+			mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE, 0);
 			shm_mq_set_receiver(mq, MyProc);
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
@@ -497,7 +497,7 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			shm_mq	   *mq;
 
 			start = error_queue_space + i * PARALLEL_ERROR_QUEUE_SIZE;
-			mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE);
+			mq = shm_mq_create(start, PARALLEL_ERROR_QUEUE_SIZE, 0);
 			shm_mq_set_receiver(mq, MyProc);
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f9dd5fc..373f920 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -567,7 +567,7 @@ ExecParallelSetupTupleQueues(ParallelContext *pcxt, bool reinitialize)
 
 		mq = shm_mq_create(tqueuespace +
 						   ((Size) i) * PARALLEL_TUPLE_QUEUE_SIZE,
-						   (Size) PARALLEL_TUPLE_QUEUE_SIZE);
+						   (Size) PARALLEL_TUPLE_QUEUE_SIZE, 4096);
 
 		shm_mq_set_receiver(mq, MyProc);
 		responseq[i] = shm_mq_attach(mq, pcxt->seg, NULL);
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..eab7bcc 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -53,6 +53,10 @@
  * mq_ring_size and mq_ring_offset never change after initialization, and
  * can therefore be read without the lock.
  *
+ * After every mq_min_send_size bytes are written the sender will update the
+ * shared memory value mq_bytes_written to avoid frequent cache invalidations
+ * and atomic operations.
+ *
  * Importantly, mq_ring can be safely read and written without a lock.
  * At any given time, the difference between mq_bytes_read and
  * mq_bytes_written defines the number of bytes within mq_ring that contain
@@ -77,6 +81,7 @@ struct shm_mq
 	pg_atomic_uint64 mq_bytes_read;
 	pg_atomic_uint64 mq_bytes_written;
 	Size		mq_ring_size;
+	Size		mq_min_send_size;
 	bool		mq_detached;
 	uint8		mq_ring_offset;
 	char		mq_ring[FLEXIBLE_ARRAY_MEMBER];
@@ -139,6 +144,9 @@ struct shm_mq_handle
 	Size		mqh_consume_pending;
 	Size		mqh_partial_bytes;
 	Size		mqh_expected_bytes;
+	uint64		mqh_last_sent_bytes;
+	uint64		mqh_bytes_read;
+	uint64		mqh_bytes_written;
 	bool		mqh_length_word_complete;
 	bool		mqh_counterparty_attached;
 	MemoryContext mqh_context;
@@ -155,7 +163,6 @@ static bool shm_mq_counterparty_gone(shm_mq *mq,
 static bool shm_mq_wait_internal(shm_mq *mq, PGPROC **ptr,
 								 BackgroundWorkerHandle *handle);
 static void shm_mq_inc_bytes_read(shm_mq *mq, Size n);
-static void shm_mq_inc_bytes_written(shm_mq *mq, Size n);
 static void shm_mq_detach_callback(dsm_segment *seg, Datum arg);
 
 /* Minimum queue size is enough for header and at least one chunk of data. */
@@ -168,7 +175,7 @@ MAXALIGN(offsetof(shm_mq, mq_ring)) + MAXIMUM_ALIGNOF;
  * Initialize a new shared message queue.
  */
 shm_mq *
-shm_mq_create(void *address, Size size)
+shm_mq_create(void *address, Size size, Size min_send_size)
 {
 	shm_mq	   *mq = address;
 	Size		data_offset = MAXALIGN(offsetof(shm_mq, mq_ring));
@@ -188,6 +195,7 @@ shm_mq_create(void *address, Size size)
 	mq->mq_ring_size = size - data_offset;
 	mq->mq_detached = false;
 	mq->mq_ring_offset = data_offset - offsetof(shm_mq, mq_ring);
+	mq->mq_min_send_size = min_send_size;
 
 	return mq;
 }
@@ -297,6 +305,9 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
 	mqh->mqh_length_word_complete = false;
 	mqh->mqh_counterparty_attached = false;
 	mqh->mqh_context = CurrentMemoryContext;
+	mqh->mqh_bytes_read = 0;
+	mqh->mqh_bytes_written = 0;
+	mqh->mqh_last_sent_bytes = 0;
 
 	if (seg != NULL)
 		on_dsm_detach(seg, shm_mq_detach_callback, PointerGetDatum(mq));
@@ -518,8 +529,17 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
 		mqh->mqh_counterparty_attached = true;
 	}
 
-	/* Notify receiver of the newly-written data, and return. */
-	SetLatch(&receiver->procLatch);
+	/*
+	 * Notify receiver of the newly-written data, only if we have written
+	 * enough data.
+	 */
+	if (mqh->mqh_bytes_written - mqh->mqh_last_sent_bytes > mq->mq_min_send_size)
+	{
+		pg_atomic_write_u64(&mq->mq_bytes_written, mqh->mqh_bytes_written);
+		mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+		SetLatch(&receiver->procLatch);
+	}
+
 	return SHM_MQ_SUCCESS;
 }
 
@@ -816,6 +836,16 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
 void
 shm_mq_detach(shm_mq_handle *mqh)
 {
+
+	if (mqh->mqh_queue->mq_min_send_size > 0 &&
+		mqh->mqh_bytes_written > mqh->mqh_last_sent_bytes)
+	{
+		pg_atomic_write_u64(&mqh->mqh_queue->mq_bytes_written,
+							mqh->mqh_bytes_written);
+		mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+		SetLatch(&mqh->mqh_queue->mq_receiver->procLatch);
+	}
+
 	/* Notify counterparty that we're outta here. */
 	shm_mq_detach_internal(mqh->mqh_queue);
 
@@ -886,15 +916,24 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 	uint64		used;
 	Size		ringsize = mq->mq_ring_size;
 	Size		available;
-
 	while (sent < nbytes)
 	{
 		uint64		rb;
 		uint64		wb;
 
 		/* Compute number of ring buffer bytes used and available. */
-		rb = pg_atomic_read_u64(&mq->mq_bytes_read);
-		wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+		wb = mqh->mqh_bytes_written;
+		rb = mqh->mqh_bytes_read;
+
+		/*
+		 * If based on the local values of the mqh_bytes_read, we don't have
+		 * enough size in the ring buffer then read the latest value from the
+		 * shared memory.  Avoid reading everytime from the shared memory will
+		 * reduce the atomic operations as well as cache misses.
+		 */
+		if ((ringsize - (wb-rb)) < nbytes)
+			mqh->mqh_bytes_read = rb = pg_atomic_read_u64(&mq->mq_bytes_read);
+
 		Assert(wb >= rb);
 		used = wb - rb;
 		Assert(used <= ringsize);
@@ -957,6 +996,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			 * Therefore, we can read it without acquiring the spinlock.
 			 */
 			Assert(mqh->mqh_counterparty_attached);
+			pg_atomic_write_u64(&mq->mq_bytes_written, mqh->mqh_bytes_written);
+			mqh->mqh_last_sent_bytes = mqh->mqh_bytes_written;
+
 			SetLatch(&mq->mq_receiver->procLatch);
 
 			/* Skip manipulation of our latch if nowait = true. */
@@ -1009,13 +1051,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			 * MAXIMUM_ALIGNOF, and each read is as well.
 			 */
 			Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
-			shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
 
 			/*
-			 * For efficiency, we don't set the reader's latch here.  We'll do
-			 * that only when the buffer fills up or after writing an entire
-			 * message.
+			 * For efficiency, we don't update the bytes written in the shared
+			 * memory and also don't set the reader's latch here.  We'll do
+			 * that only when the buffer fills up or after writing more than
+			 * writing 4k data.
 			 */
+			mqh->mqh_bytes_written += MAXALIGN(sendnow);
 		}
 	}
 
@@ -1253,28 +1296,6 @@ shm_mq_inc_bytes_read(shm_mq *mq, Size n)
 	SetLatch(&sender->procLatch);
 }
 
-/*
- * Increment the number of bytes written.
- */
-static void
-shm_mq_inc_bytes_written(shm_mq *mq, Size n)
-{
-	/*
-	 * Separate prior reads of mq_ring from the write of mq_bytes_written
-	 * which we're about to do.  Pairs with the read barrier found in
-	 * shm_mq_receive_bytes.
-	 */
-	pg_write_barrier();
-
-	/*
-	 * There's no need to use pg_atomic_fetch_add_u64 here, because nobody
-	 * else can be changing this value.  This method avoids taking the bus
-	 * lock unnecessarily.
-	 */
-	pg_atomic_write_u64(&mq->mq_bytes_written,
-						pg_atomic_read_u64(&mq->mq_bytes_written) + n);
-}
-
 /* Shim for on_dsm_detach callback. */
 static void
 shm_mq_detach_callback(dsm_segment *seg, Datum arg)
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..58a34a9 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -47,7 +47,7 @@ typedef enum
  * or written, but they need not be set by the same process.  Each must be
  * set exactly once.
  */
-extern shm_mq *shm_mq_create(void *address, Size size);
+extern shm_mq *shm_mq_create(void *address, Size size, Size min_send_size);
 extern void shm_mq_set_receiver(shm_mq *mq, PGPROC *);
 extern void shm_mq_set_sender(shm_mq *mq, PGPROC *);
 
diff --git a/src/test/modules/test_shm_mq/setup.c b/src/test/modules/test_shm_mq/setup.c
index e05e97c..0bbc7cb 100644
--- a/src/test/modules/test_shm_mq/setup.c
+++ b/src/test/modules/test_shm_mq/setup.c
@@ -143,7 +143,7 @@ setup_dynamic_shared_memory(int64 queue_size, int nworkers,
 		shm_mq	   *mq;
 
 		mq = shm_mq_create(shm_toc_allocate(toc, (Size) queue_size),
-						   (Size) queue_size);
+						   (Size) queue_size, 0);
 		shm_toc_insert(toc, i + 1, mq);
 
 		if (i == 0)
-- 
1.8.3.1

#3Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#1)
Re: Gather performance analysis

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#3)
1 attachment(s)
Re: Gather performance analysis

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads, as
well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September
commitfest.

I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload
From 1142c46eaa3dcadc4bb28133ce873476ba3067c4 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v1] Optimize parallel tuple send (shm_mq_send_bytes)

Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size.  This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
 src/backend/executor/tqueue.c         |  2 +-
 src/backend/libpq/pqmq.c              |  7 +++-
 src/backend/storage/ipc/shm_mq.c      | 65 +++++++++++++++++++++++++++++------
 src/include/storage/shm_mq.h          |  8 +++--
 src/test/modules/test_shm_mq/test.c   |  7 ++--
 src/test/modules/test_shm_mq/worker.c |  2 +-
 6 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Send the tuple itself. */
 	tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
-	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false, false);
 
 	if (should_free)
 		pfree(tuple);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index d1a1f47..846494b 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -154,7 +154,12 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 
 	for (;;)
 	{
-		result = shm_mq_sendv(pq_mq_handle, iov, 2, true);
+		/*
+		 * Immediately notify the receiver by passing force_flush as true so
+		 * that the shared memory value is updated before we send the parallel
+		 * message signal right after this.
+		 */
+		result = shm_mq_sendv(pq_mq_handle, iov, 2, true, true);
 
 		if (pq_mq_parallel_leader_pid != 0)
 			SendProcSignal(pq_mq_parallel_leader_pid,
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..3e1781c 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -120,6 +120,12 @@ struct shm_mq
  * message itself, and mqh_expected_bytes - which is used only for reads -
  * tracks the expected total size of the payload.
  *
+ * mqh_send_pending, is number of bytes that is written to the queue but not
+ * yet updated in the shared memory.  We will not update it until the written
+ * data is 1/4th of the ring size or the tuple queue is full.  This will
+ * prevent frequent CPU cache misses, and it will also avoid frequent
+ * SetLatch() calls, which are quite expensive.
+ *
  * mqh_counterparty_attached tracks whether we know the counterparty to have
  * attached to the queue at some previous point.  This lets us avoid some
  * mutex acquisitions.
@@ -139,6 +145,7 @@ struct shm_mq_handle
 	Size		mqh_consume_pending;
 	Size		mqh_partial_bytes;
 	Size		mqh_expected_bytes;
+	Size		mqh_send_pending;
 	bool		mqh_length_word_complete;
 	bool		mqh_counterparty_attached;
 	MemoryContext mqh_context;
@@ -294,6 +301,7 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
 	mqh->mqh_consume_pending = 0;
 	mqh->mqh_partial_bytes = 0;
 	mqh->mqh_expected_bytes = 0;
+	mqh->mqh_send_pending = 0;
 	mqh->mqh_length_word_complete = false;
 	mqh->mqh_counterparty_attached = false;
 	mqh->mqh_context = CurrentMemoryContext;
@@ -317,16 +325,22 @@ shm_mq_set_handle(shm_mq_handle *mqh, BackgroundWorkerHandle *handle)
 
 /*
  * Write a message into a shared message queue.
+ *
+ * When force_flush = true, we immediately update the shm_mq's mq_bytes_written
+ * and notify the receiver if it is already attached.  Otherwise, we don't
+ * update it until we have written an amount of data greater than 1/4th of the
+ * ring size.
  */
 shm_mq_result
-shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
+shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
+			bool force_flush)
 {
 	shm_mq_iovec iov;
 
 	iov.data = data;
 	iov.len = nbytes;
 
-	return shm_mq_sendv(mqh, &iov, 1, nowait);
+	return shm_mq_sendv(mqh, &iov, 1, nowait, force_flush);
 }
 
 /*
@@ -343,9 +357,12 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
  * arguments, each time the process latch is set.  (Once begun, the sending
  * of a message cannot be aborted except by detaching from the queue; changing
  * the length or payload will corrupt the queue.)
+ *
+ * For force_flush, refer comments atop shm_mq_send interface.
  */
 shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
+shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+			 bool force_flush)
 {
 	shm_mq_result res;
 	shm_mq	   *mq = mqh->mqh_queue;
@@ -518,8 +535,19 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
 		mqh->mqh_counterparty_attached = true;
 	}
 
-	/* Notify receiver of the newly-written data, and return. */
-	SetLatch(&receiver->procLatch);
+	/*
+	 * If we have written more than 1/4 of the ring or the caller has
+	 * requested force flush, mark it as written in shared memory and notify
+	 * the receiver.  For more detail refer comments atop shm_mq_handle
+	 * structure.
+	 */
+	if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
+	{
+		shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+		SetLatch(&receiver->procLatch);
+		mqh->mqh_send_pending = 0;
+	}
+
 	return SHM_MQ_SUCCESS;
 }
 
@@ -816,6 +844,13 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
 void
 shm_mq_detach(shm_mq_handle *mqh)
 {
+	/* Before detaching, notify already written data to the receiver. */
+	if (mqh->mqh_send_pending > 0)
+	{
+		shm_mq_inc_bytes_written(mqh->mqh_queue, mqh->mqh_send_pending);
+		mqh->mqh_send_pending = 0;
+	}
+
 	/* Notify counterparty that we're outta here. */
 	shm_mq_detach_internal(mqh->mqh_queue);
 
@@ -894,7 +929,7 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 
 		/* Compute number of ring buffer bytes used and available. */
 		rb = pg_atomic_read_u64(&mq->mq_bytes_read);
-		wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+		wb = pg_atomic_read_u64(&mq->mq_bytes_written) + mqh->mqh_send_pending;
 		Assert(wb >= rb);
 		used = wb - rb;
 		Assert(used <= ringsize);
@@ -951,6 +986,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 		}
 		else if (available == 0)
 		{
+			/* Update the pending send bytes in the shared memory. */
+			shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+
 			/*
 			 * Since mq->mqh_counterparty_attached is known to be true at this
 			 * point, mq_receiver has been set, and it can't change once set.
@@ -959,6 +997,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			Assert(mqh->mqh_counterparty_attached);
 			SetLatch(&mq->mq_receiver->procLatch);
 
+			/*
+			 * We have just updated the mqh_send_pending bytes in the shared
+			 * memory so reset it.
+			 */
+			mqh->mqh_send_pending = 0;
+
 			/* Skip manipulation of our latch if nowait = true. */
 			if (nowait)
 			{
@@ -1009,13 +1053,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			 * MAXIMUM_ALIGNOF, and each read is as well.
 			 */
 			Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
-			shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
 
 			/*
-			 * For efficiency, we don't set the reader's latch here.  We'll do
-			 * that only when the buffer fills up or after writing an entire
-			 * message.
+			 * For efficiency, we don't update the bytes written in the shared
+			 * memory and also don't set the reader's latch here.  Refer to
+			 * the comments atop the shm_mq_handle structure for more
+			 * information.
 			 */
+			mqh->mqh_send_pending += MAXALIGN(sendnow);
 		}
 	}
 
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..cb1c555 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -70,11 +70,13 @@ extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
 
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
-								 Size nbytes, const void *data, bool nowait);
-extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh,
-								  shm_mq_iovec *iov, int iovcnt, bool nowait);
+								 Size nbytes, const void *data, bool nowait,
+								 bool force_flush);
+extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov,
+								  int iovcnt, bool nowait, bool force_flush);
 extern shm_mq_result shm_mq_receive(shm_mq_handle *mqh,
 									Size *nbytesp, void **datap, bool nowait);
+extern void shm_mq_flush(shm_mq_handle *mqh);
 
 /* Wait for our counterparty to attach to the queue. */
 extern shm_mq_result shm_mq_wait_for_attach(shm_mq_handle *mqh);
diff --git a/src/test/modules/test_shm_mq/test.c b/src/test/modules/test_shm_mq/test.c
index 2d8d695..be074f0 100644
--- a/src/test/modules/test_shm_mq/test.c
+++ b/src/test/modules/test_shm_mq/test.c
@@ -73,7 +73,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 	test_shm_mq_setup(queue_size, nworkers, &seg, &outqh, &inqh);
 
 	/* Send the initial message. */
-	res = shm_mq_send(outqh, message_size, message_contents, false);
+	res = shm_mq_send(outqh, message_size, message_contents, false, true);
 	if (res != SHM_MQ_SUCCESS)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -97,7 +97,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -177,7 +177,8 @@ test_shm_mq_pipelined(PG_FUNCTION_ARGS)
 		 */
 		if (send_count < loop_count)
 		{
-			res = shm_mq_send(outqh, message_size, message_contents, true);
+			res = shm_mq_send(outqh, message_size, message_contents, true,
+							  true);
 			if (res == SHM_MQ_SUCCESS)
 			{
 				++send_count;
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 2180776..9b037b9 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -190,7 +190,7 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			break;
 	}
-- 
1.8.3.1

#5Zhihong Yu
zyu@yugabyte.com
In reply to: Dilip Kumar (#4)
Re: Gather performance analysis

On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September
commitfest.

I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi,
Some minor comments.
For shm_mq.c, existing comment says:

* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.

+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers

#6Zhihong Yu
zyu@yugabyte.com
In reply to: Zhihong Yu (#5)
Re: Gather performance analysis

On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:

On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com>
wrote:

On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:

Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms

2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms

3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms

This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.

But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...

Right, good observation.

- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)

...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.

Yes

If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.

In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.

Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.

Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;

Target Query: SELECT random() FROM t GROUP BY a;

Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms

20% gain compared force parallel, 45% gain compared to default plan.

Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from
generate_series(1,200000000) as i;

set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms

8 to 10 % gain compared to the default parallel plan.

I have done cleanup in the patch and I will add this to the September
commitfest.

I am planning to do further testing for identifying the optimal batch
size in different workloads. WIth above workload I am seeing similar
results with batch size 4k to 16k (1/4 of the ring size) so in the attached
patch I have kept as 1/4 of the ring size. We might change that based on
more analysis and testing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi,
Some minor comments.
For shm_mq.c, existing comment says:

* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.

+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)

The above can be written as:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))

so that when force_flush is true, the other condition is not evaluated.

Cheers

There was a typo in suggested code above. It should be:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Cheers

#7Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#4)
Re: Gather performance analysis

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#1)
Re: Gather performance analysis

Hi,

On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:

--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;

Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interrupt

Perf analysis: Worker Node
- 99.14% shm_mq_sendv
- 74.10% shm_mq_send_bytes
+ 42.35% shm_mq_inc_bytes_written
- 32.56% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- 86.27% apic_timer_interrupt
+ 17.93% WaitLatch

From the perf results and also from the code analysis I can think of
two main problems here

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).

Greetings,

Andres Freund

#9Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#7)
3 attachment(s)
Re: Gather performance analysis

On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Thanks for showing interest.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

I have attached the output of the lscpu. I think batching the data before
updating in the shared memory will win because we are avoiding the frequent
cache misses and IMHO the benefit will be more in the machine with more CPU
sockets.

Now, this is mostly expected, but the consequence is that maybe things

like queue size should be tunable/dynamic, not hard-coded?

Actually, my intention behind the tuple queue size was to just see the
behavior. Do we really have the problem of workers stalling on queue while
sending the tuple, the perf report showed some load on WaitLatch on the
worker side so I did this experiment. I saw some benefits but it was not
really huge. I am not sure whether we want to just increase the tuple
queue size or make it tunable, but if we want to support redistribute
operators in future sometime then maybe we should make it dynamically
growing at runtime, maybe using dsa or dsa + shared files.

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

Yeah, I was thinking about such cases, basically, this design can increase
the startup cost of the Gather node, I will also try to derive such cases
and test them.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?

Actually, these both patches are the same,
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the
cleaner version of the first patch. For configurable tuple queue size I
did not send a patch, because that is I just used for the testing purpose
and never intended to to propose anything. My most of the latest
performance data I sent with only
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with
default tuple queue size.

But I am attaching both the patches in case you want to play around.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

cpuinfoapplication/octet-stream; name=cpuinfoDownload
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload
From 84c2e46808b59f6bf7a782f6b0735dafc4e89e13 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v1 1/2] Optimize parallel tuple send (shm_mq_send_bytes)

Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size.  This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
 src/backend/executor/tqueue.c         |  2 +-
 src/backend/libpq/pqmq.c              |  7 +++-
 src/backend/storage/ipc/shm_mq.c      | 65 +++++++++++++++++++++++++++++------
 src/include/storage/shm_mq.h          |  8 +++--
 src/test/modules/test_shm_mq/test.c   |  7 ++--
 src/test/modules/test_shm_mq/worker.c |  2 +-
 6 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Send the tuple itself. */
 	tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
-	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false, false);
 
 	if (should_free)
 		pfree(tuple);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index d1a1f47..846494b 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -154,7 +154,12 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 
 	for (;;)
 	{
-		result = shm_mq_sendv(pq_mq_handle, iov, 2, true);
+		/*
+		 * Immediately notify the receiver by passing force_flush as true so
+		 * that the shared memory value is updated before we send the parallel
+		 * message signal right after this.
+		 */
+		result = shm_mq_sendv(pq_mq_handle, iov, 2, true, true);
 
 		if (pq_mq_parallel_leader_pid != 0)
 			SendProcSignal(pq_mq_parallel_leader_pid,
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..3e1781c 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -120,6 +120,12 @@ struct shm_mq
  * message itself, and mqh_expected_bytes - which is used only for reads -
  * tracks the expected total size of the payload.
  *
+ * mqh_send_pending, is number of bytes that is written to the queue but not
+ * yet updated in the shared memory.  We will not update it until the written
+ * data is 1/4th of the ring size or the tuple queue is full.  This will
+ * prevent frequent CPU cache misses, and it will also avoid frequent
+ * SetLatch() calls, which are quite expensive.
+ *
  * mqh_counterparty_attached tracks whether we know the counterparty to have
  * attached to the queue at some previous point.  This lets us avoid some
  * mutex acquisitions.
@@ -139,6 +145,7 @@ struct shm_mq_handle
 	Size		mqh_consume_pending;
 	Size		mqh_partial_bytes;
 	Size		mqh_expected_bytes;
+	Size		mqh_send_pending;
 	bool		mqh_length_word_complete;
 	bool		mqh_counterparty_attached;
 	MemoryContext mqh_context;
@@ -294,6 +301,7 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
 	mqh->mqh_consume_pending = 0;
 	mqh->mqh_partial_bytes = 0;
 	mqh->mqh_expected_bytes = 0;
+	mqh->mqh_send_pending = 0;
 	mqh->mqh_length_word_complete = false;
 	mqh->mqh_counterparty_attached = false;
 	mqh->mqh_context = CurrentMemoryContext;
@@ -317,16 +325,22 @@ shm_mq_set_handle(shm_mq_handle *mqh, BackgroundWorkerHandle *handle)
 
 /*
  * Write a message into a shared message queue.
+ *
+ * When force_flush = true, we immediately update the shm_mq's mq_bytes_written
+ * and notify the receiver if it is already attached.  Otherwise, we don't
+ * update it until we have written an amount of data greater than 1/4th of the
+ * ring size.
  */
 shm_mq_result
-shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
+shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
+			bool force_flush)
 {
 	shm_mq_iovec iov;
 
 	iov.data = data;
 	iov.len = nbytes;
 
-	return shm_mq_sendv(mqh, &iov, 1, nowait);
+	return shm_mq_sendv(mqh, &iov, 1, nowait, force_flush);
 }
 
 /*
@@ -343,9 +357,12 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
  * arguments, each time the process latch is set.  (Once begun, the sending
  * of a message cannot be aborted except by detaching from the queue; changing
  * the length or payload will corrupt the queue.)
+ *
+ * For force_flush, refer comments atop shm_mq_send interface.
  */
 shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
+shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+			 bool force_flush)
 {
 	shm_mq_result res;
 	shm_mq	   *mq = mqh->mqh_queue;
@@ -518,8 +535,19 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
 		mqh->mqh_counterparty_attached = true;
 	}
 
-	/* Notify receiver of the newly-written data, and return. */
-	SetLatch(&receiver->procLatch);
+	/*
+	 * If we have written more than 1/4 of the ring or the caller has
+	 * requested force flush, mark it as written in shared memory and notify
+	 * the receiver.  For more detail refer comments atop shm_mq_handle
+	 * structure.
+	 */
+	if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
+	{
+		shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+		SetLatch(&receiver->procLatch);
+		mqh->mqh_send_pending = 0;
+	}
+
 	return SHM_MQ_SUCCESS;
 }
 
@@ -816,6 +844,13 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
 void
 shm_mq_detach(shm_mq_handle *mqh)
 {
+	/* Before detaching, notify already written data to the receiver. */
+	if (mqh->mqh_send_pending > 0)
+	{
+		shm_mq_inc_bytes_written(mqh->mqh_queue, mqh->mqh_send_pending);
+		mqh->mqh_send_pending = 0;
+	}
+
 	/* Notify counterparty that we're outta here. */
 	shm_mq_detach_internal(mqh->mqh_queue);
 
@@ -894,7 +929,7 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 
 		/* Compute number of ring buffer bytes used and available. */
 		rb = pg_atomic_read_u64(&mq->mq_bytes_read);
-		wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+		wb = pg_atomic_read_u64(&mq->mq_bytes_written) + mqh->mqh_send_pending;
 		Assert(wb >= rb);
 		used = wb - rb;
 		Assert(used <= ringsize);
@@ -951,6 +986,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 		}
 		else if (available == 0)
 		{
+			/* Update the pending send bytes in the shared memory. */
+			shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+
 			/*
 			 * Since mq->mqh_counterparty_attached is known to be true at this
 			 * point, mq_receiver has been set, and it can't change once set.
@@ -959,6 +997,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			Assert(mqh->mqh_counterparty_attached);
 			SetLatch(&mq->mq_receiver->procLatch);
 
+			/*
+			 * We have just updated the mqh_send_pending bytes in the shared
+			 * memory so reset it.
+			 */
+			mqh->mqh_send_pending = 0;
+
 			/* Skip manipulation of our latch if nowait = true. */
 			if (nowait)
 			{
@@ -1009,13 +1053,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			 * MAXIMUM_ALIGNOF, and each read is as well.
 			 */
 			Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
-			shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
 
 			/*
-			 * For efficiency, we don't set the reader's latch here.  We'll do
-			 * that only when the buffer fills up or after writing an entire
-			 * message.
+			 * For efficiency, we don't update the bytes written in the shared
+			 * memory and also don't set the reader's latch here.  Refer to
+			 * the comments atop the shm_mq_handle structure for more
+			 * information.
 			 */
+			mqh->mqh_send_pending += MAXALIGN(sendnow);
 		}
 	}
 
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..cb1c555 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -70,11 +70,13 @@ extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
 
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
-								 Size nbytes, const void *data, bool nowait);
-extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh,
-								  shm_mq_iovec *iov, int iovcnt, bool nowait);
+								 Size nbytes, const void *data, bool nowait,
+								 bool force_flush);
+extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov,
+								  int iovcnt, bool nowait, bool force_flush);
 extern shm_mq_result shm_mq_receive(shm_mq_handle *mqh,
 									Size *nbytesp, void **datap, bool nowait);
+extern void shm_mq_flush(shm_mq_handle *mqh);
 
 /* Wait for our counterparty to attach to the queue. */
 extern shm_mq_result shm_mq_wait_for_attach(shm_mq_handle *mqh);
diff --git a/src/test/modules/test_shm_mq/test.c b/src/test/modules/test_shm_mq/test.c
index 2d8d695..be074f0 100644
--- a/src/test/modules/test_shm_mq/test.c
+++ b/src/test/modules/test_shm_mq/test.c
@@ -73,7 +73,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 	test_shm_mq_setup(queue_size, nworkers, &seg, &outqh, &inqh);
 
 	/* Send the initial message. */
-	res = shm_mq_send(outqh, message_size, message_contents, false);
+	res = shm_mq_send(outqh, message_size, message_contents, false, true);
 	if (res != SHM_MQ_SUCCESS)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -97,7 +97,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -177,7 +177,8 @@ test_shm_mq_pipelined(PG_FUNCTION_ARGS)
 		 */
 		if (send_count < loop_count)
 		{
-			res = shm_mq_send(outqh, message_size, message_contents, true);
+			res = shm_mq_send(outqh, message_size, message_contents, true,
+							  true);
 			if (res == SHM_MQ_SUCCESS)
 			{
 				++send_count;
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 2180776..9b037b9 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -190,7 +190,7 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			break;
 	}
-- 
1.8.3.1

v1-0002-poc-test-parallel_tuple_queue_size.patchtext/x-patch; charset=US-ASCII; name=v1-0002-poc-test-parallel_tuple_queue_size.patchDownload
From 455026b0f70eec8acf3565824c03a9e20588e9cc Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 26 Jul 2021 20:18:48 +0530
Subject: [PATCH v1 2/2] poc-test-parallel_tuple_queue_size

---
 src/backend/executor/execParallel.c |  3 ++-
 src/backend/utils/misc/guc.c        | 10 ++++++++++
 src/include/storage/pg_shmem.h      |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f8a4a40..f9dd5fc 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
+int			parallel_tuple_queue_size =	65536;
 /*
  * Magic numbers for parallel executor communication.  We use constants
  * greater than any 32-bit integer here so that values < 2^32 can be used
@@ -67,7 +68,7 @@
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
 
-#define PARALLEL_TUPLE_QUEUE_SIZE		65536
+#define PARALLEL_TUPLE_QUEUE_SIZE		parallel_tuple_queue_size * 1024L
 
 /*
  * Fixed-size random stuff that we need to pass to parallel workers.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c339acf..4d5dca5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3346,6 +3346,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"parallel_tuple_queue_size", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the parallel tuple queue size."),
+			GUC_UNIT_KB
+		},
+		&parallel_tuple_queue_size,
+		64, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"autovacuum_work_mem", PGC_SIGHUP, RESOURCES_MEM,
 			gettext_noop("Sets the maximum memory to be used by each autovacuum worker process."),
 			NULL,
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 059df1b..9182a0e 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 extern int	shared_memory_type;
 extern int	huge_pages;
 extern int	huge_page_size;
+extern int	parallel_tuple_queue_size;
 
 /* Possible values for huge_pages */
 typedef enum
-- 
1.8.3.1

#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#8)
Re: Gather performance analysis

On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.

Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.

FWIW, I see times like this

postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN

├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual
rows=200000000 loops=1) │
│ Workers Planned: 2

│ Workers Launched: 2

│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333
width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms

│ Execution Time: 24954.012 ms

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?

Looking at a profile I see the biggest bottleneck in the leader (which is

the
bottleneck as soon as the worker count is increased) to be reading the
length
word of the message. I do see shm_mq_receive_bytes() in the profile, but
the
costly part there is the "read % (uint64) ringsize" - divisions are slow.
We
could just compute a mask instead of the size.

Yeah that could be done, I can test with this change as well that how much
we gain with this.

We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly
the
ringsize (or mask).

Good suggestion, I will do some experiments around this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#11Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#10)
Re: Gather performance analysis

Hi,

On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:

On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.

Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.

I'm afraid that makes the profiles just about meaningless :(.

Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?

It was without any patches. I'll try the patch in a bit.

Greetings,

Andres Freund

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#11)
Re: Gather performance analysis

On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:

On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:

Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)"

for

me.

Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.

I'm afraid that makes the profiles just about meaningless :(.

Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#13Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#12)
3 attachment(s)
Re: Gather performance analysis

On 9/8/21 9:40 AM, Dilip Kumar wrote:

On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:

Hi,

On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:

On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de

<mailto:andres@anarazel.de>> wrote:

Looking at this profile made me wonder if this was a build without
optimizations. The

pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls

should
be inlined. And while perf can reconstruct inlined functions

when using

--call-graph=dwarf, they show up like "pg_atomic_read_u64

(inlined)" for

me.

Yeah, for profiling generally I build without optimizations so

that I can

see all the functions in the stack, so yeah profile results are

without

optimizations build but the performance results are with optimizations
build.

I'm afraid that makes the profiles just about meaningless :(.

Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem.  But your point is taken and I will run with an
optimized build.

IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.

It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.

As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.

Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.

The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

queue bench - comparison.pdfapplication/pdf; name="queue bench - comparison.pdf"Download
results.tgzapplication/x-compressed-tar; name=results.tgzDownload
���m����&�k�����3�vU����G}�.�����u��wvV��J�uT��[������}����@�d�%f����3ss��>���z��g��C��p��0�U�g�Hi�7���	���?��uo+��>�v?����������M��F�������?����O��{����o�����{�n'���~7��q������^��^�����j��><������������'�?���{������������<<>�?����W��������?������?�����b������������p����������������Gz�'_���/���������;��������������w��'���\��:��W�W��
��'������?�A/�W�~��������'�{m}�/w�o�_=��L�o�F~xm������?����}�w����e����?������w��n������_�_����p[�x��O������k=����������?9���������{���?�~����������������_�~��W������� ?o&���q��N��N��k����5���������{|���w�Uo���_��/��//�A��>��}{�&���y������?���w����O�����?�����_>���ww�����?��{��7�/|����Oyxu��/o����?������pt��/������������������}��W_|����#}�����q���������B�r��������;�����GS����	�������?���������vw�����_�5��������3�'�(�~�����C�n��t��o�Wk���_��}|��z���o������v���?���J���H��?���o�����R���������;�2|]��w�E?/������������z
�om��� �����9��N�����U=��������/�1�O
%�W�������[���;�h����kT��+RCjT6Z���*�4)�?��;}����O�c���3>������`�w�_�o E�����#��_��y��������w��{�s|�k�U�
�^��N/�������g����~GNEs1t��K�\HT*F|������/������o����������������#��k'f�5�b_��{�n�8v��������*|����[�q�QYtD>����C���b��O��~||w��B��#����~�������y�;�^�`E��^�$]�O��!Z������������U^�
W�1W*T�i���d�k�J;|��7�+E��J��J������q�����n�+e7\��%���q����5q��.#���JD����W���`�X�^!2x�2{\Y��b�U~��W��
��,���U����hws�
�+
\)����
�Y���J�����p%�f�F�~`�"��Q�U��I~`0�g��L4����]W��R\�n�<r��[��%��GO�
z��	�W��L�����v�3!!�:UY�E�� �AA�{��������$���Z��M������H��!���q�<U.���!�����M���C���&lr����I����P��1"�����N1�������� 79\B*���:�>t$���b9�Bg��^BQ�xrh�vY/�x�r(d��B�s�!\�����Ce�vYh��]������_������E��b"���hC��`2R
����� 9+��&�ic4�5��[����{�����SG����H�m�a4��`���m��T+�f���EO"�6P��B���J�(�	����g*\P~k[�(�P�,���	<D�����zu��WR�x���h�<���*�x�x�U�N�/;p���������a�U����K�i@��L��G���j���(��5`�Y��
j^��b�|����zCTn�ET�8"*V�����3�@8q;�&%e,v�b.7f9�w�&J��1�(����QORGQ3������&sT���W^���%&
��b����)5�*�GU���t��S��B�B�i�7�
����p����<X�c��B72��J�l	bc����}��)��i����S�����lg[�<X�QT!���`�`%#�6d�@*^%V
�/W��O��0��N�j�����:�8�fv]
�L�Z��6X�����toPk^!V�`���UX�bk����
�����+��VFnk��1V�*Y���Ob���d���������U��*+�f5�	��Q
+����Q8����I��4�VXR]LFaz
�*m+yQ��&!|��p&�x���SqS��(`��SPoR����N���m1�����I8�N-�J��`B��K!�����<�t�)m�e$�,�)M;��������B
<6�����R.W��\'��,J��k�R�d���YTm�����}�L�����R5������k�R���R*�����z��*W���!eZ�z���46H���9��^�R���E��!�a3��2���r��P�0���!��U�nV*;�C�]��4������� �O���<���S���9�f��w�&�!es,z��n#��.��#T)��,�����+�bXNW�37Z�4��5:�:�����TV?�vi����w���] [���i�4���g�0�?����������a���xF�m�k�?	Zk��*VX�
BA{�����X��AHx��Rj�!��+[�-�#u����n���pB	
������� T>m����<�,�V�+&��e@3���K���a����}�$��]u����w	�Ff���	�J��'���B�����V4��9�;�`��>�|�53�?
�*t���_wM��N\�)��>z�T�t��g.0�u9�9@X���=�@����A�>�4JIRs4	5	��� ��8"��KM"���V/��x�@g������- �����V��V,n;��������Av�kQ����P�|��c�����]���3���j����:X?mb���
�����Na�;M�"e95���2�u�g�$x<�5-�U~����)�X�-c]����$�DO���2�m��Me�2b�����������3[��J���V9g���r��ms;x��|NB�R��1"RZ������~Z��G��u]���w��E��FA`�]v5�F�f���|��#�O'meM9�T.�cZ�����`�0T����pd,C�	n`�3�J��P���0�M��PQ���R�TA=C��p��d�#u�fK�ah2"<�S��4g6��
�J�1
��E���K`�����.d�Z�����%��k�Cx0D��28�U����R/�S�4�j��0��:��h��K(3�8��B��E`�p���a�<bK�
�ccc���=X��*���0h�����'���0�>i����f:��L����'���G)���Us�\!
�t�XL��J������l�|��kL���+�l���ui"���w�L�p���=��L���f�|(lp���!@N��G�"0�%�|�e`(��<��p���9}.�>F�-�����
C����=�[�O��y�l�TV�#���Z`��y����A����U448�����=�-@3:�a$E�}N�V�0�m��|�F3����`t9uy�p��L�2.�E`�p�����PnU
�c�4��������o
���}��
�P&�sh
�an�g���5�Ja���$,7���H�h�������PWQ���t�1�CY���jT�9��b�������CbD��\�(���'��h�p�����5��HTm$"K�����S���}�VD"�*=8���DE���o5�H4*�P�Z�q�0!���N*
��6$��������������c�q�9$J'(�TZ_lc����e��%��Z�������-��5���)��!��QA6�{��hL�0���\L1��L�D��v�|�{[)���8�����84��ol-?���8���J���Qf6/p�]Y��ep�&��aU���L��h�����h
66���.a5xDYY����X�����8l�B���I�.����we��G�](L�1�I!��}��|L9r���~8���G��^,&���1��c�Tx�����0���;�c>tm����IXT����U��sy>C.t�x�:��
�[�"0�"{��Q�s@����x�P����;:�n���zQ^M�:&��w.�@�U�����L�#N�9=��,�M�cB���#|�[U��X��FCVWU��f�u��Lm��p'���T���q��;1��-g����	zL��6tm�����]�>��Y��\�kTjb���[a�����J[nuM�a��	��62�n�,�������i�A���Ws4mZ�u�%���#����$L��������S�:��9�P��^+S���bl�d��"����D��c��jqY�,�4�3[zV���_��'ae��no�M�_.$����W��H,)�&����u��bY��c9��N=���M���a"Q��Avc{AY��E����>����>p��t.�3�u?�m�/������\H�M��L8�n�+�xS����e������e��X����Bk�V�����+�GS0P�$��� x���kLG�iu�1�tyf���P\i4���|���	fO2Y1\wF�O�;;KXI7�����L`�%_��A�����eEd?�����.�U��c��_����-ec9�^��M�j��d?%�1��*�q��Awd,]Y)G���+^
�����-���� ��r��\
N��v��n
=&t[���2b�9�X�h|��!��i	�F��{`��!��v��t�]�p
Mt�n�[C�	]��.m��X]�X]��LX��$�Z%n����}��l2��4�!\;t���2���s�j��s(�x���N�
��Mg?a�Fk����5������hlh�����H�{'z��3�������PA�D{y���5���Um�J�Q������Eb�*��Z�B71��C���.C��ByS�,wLt�z�J5��v�V�
����;:�n�WQ}��"�\�	]t��o]����]E]����[`�����g�n����KvkC8>�C�7������]KV�.Or���5t�.���9���5��B7A�	]���WtG�"k]i}Z��!���J>����+A��b�������W�J�+��0HT�����Q��X[�P��t��TN81;��a����(���5M�������z3��$�R��!���K�s[����5���=��(k�E�G��E~'����f�A/��	@��w/�^J�W�
M�4z���C�T��N����E�n�S6���E�K`)�����W'�����?	���������/L������
c�����l������!q9��f^�T�l�5��4W�^tP�U��������:\�����E�!fe�+l��m�P���'%L�3���^�Sh��k`Df����,��M��^��+��k���W��(c���pt����(gU�u�������"y��G-��~�Rd��������9�/�-��ye�6���+lG�t����in)���3��l�y��Oz��7Iz���Iz��7Jz���Jz�:�s~V0E5U|x�3�2l0�2��1_g�K��H`�c�of���������d���[�
�k1�����3�Jn�����L��i����H��.��H�zMsu���N����#XT8QL?�l��f���F�)���L�q:YY�UJ���:���S����ydM��qON���-/�3��e���=^��tZ���k������1���`d����$��E�"��-�?_g��uB�K�l�
���Xg����36?cHg��/�=�n|����M���)��v�y����j8��q�B�O�P���5�'��e'x��*��������/L��$��4������*'7��c�Yz����lR��h�	mJ�S(�]8��:t����-L��6����V�E�&+��3� ��(��HIw�esY�m�����}[��mm�&�Lm^,������?����ew��D�![*�x�ms9�����mm�&����aA�|G���6Le��6<8�=��Q+@i���I�������k��?�%����n�1[dut�#��R
�Yl�&#�."�L"��O����5��>�y�7�/O��y��g����++��/����Hl��EV��K��:�����-�tt��J�B�}�t��.e��O���]Zg`DZ�*�b���W�3����A��0�����QD����������4BU.�Lu�!�����34�n�2<�32��4�v��4O@g$�3u�i�M�[>@c���$zX�`�g����3p7C!�����3�Q�b�a���i�����D:v<��ur�l:cd,�3��7��1+��:C�:C	~���uFZ�x��u���$�L{�9���*����~+�����mt�u�� QM�M���O3�b�D��1�]��- �?#}i~���#?��������M��B��Q�l:c(#=a��3Z�rn������3�@����3���J��L7�:u��ly���c��v[���p<V�~��(�}������u��p�,w��@J��e|�7	���Pu�� E����}������	�L�a�:Cl�o�c��L:�����3BbR��}��~�8D
Y�oCn����l~FFgN�[���?�-�1:�
t��8mE�~����X�x�V1L��e�3�+�3r�}X	�F��'�3��:����� ]�����:��2��G��n�&�M���@ep������5��n�Z_I}Q����'(���G�&��kE�l������jx�$f*���j���%�su�!�dvW���:+�L��������un�FYE�g\^gH����U�3d���:l�����	�\���:��kN�z:�WG\�(]Y&��I~F���67�3�;��}����g�To,�/3:!���N"J_sR��`���Rb+�Uu��)O(Q���H~�m�3�uF�N���M2�K�f�t�/���+���tF�6�$�&�Q���D��������M~�h�u�������������q�x�k��aVW��5�����D�� �qg#��y�N��X�20:N�\��R��@��i�����{����-lq,�U[�����*3L5��tS*��#�$���E�"�
yL�,
��Ey��,h9+���7���
��a���r1,(����e�+���`	���d���X*�P�X�.K�j	��`r�i{,��Bo�gE`��eY��F!����%�B�dj�5��/5�!�r�����#�kV�W��,U�����l#M�E
'�k�FY�
�`��@�E��f�����C"�$�{��XZD�r����5��!����3>���
�.���"G�,	G��A��]T)?9��8�u9����39�AL��rf-O���>s�.+���lp��N�x����D'��,�/eM���+]�����vmr��g�n�3�M�Y~q��Z�n���������\�f���Ym7�Y^���0�r6��l)�9\�0 g������g���B7*���@Q ����d���n������'U�ek�����-��)���=�	��E�ir��������,��.��]4�b��}z
��g��	,�������HuC`		,�,��c0��,w,���Y��6��\����E��|���`��u�-����Ss����������6�d�,��`1��U`�]���X<wW��`1�,!�w��}2�,Kk��E7�,E�E���,H��`���rW[
�KN�����n�.��������j��X`�@_������]qU��l�|��e1�+n-wW|ij��=	G#`1�&\�4������2r��$���E��"�����I�~9X�K`1����L!X��VL���B�$�}��+T�?c�ZR�����Es�����`����nX�xX�,Nm�,��X�e��,H��`�L�����R�dl��k>X�*�g�Q6b��-�$X$�;��s,�I`q7����X��m`�qv�8I|X�\��S[E�(Z8�0U��R)�Z�cIfUd�a����-|�!�i�TT������������E?����E���i��j��
&��nKf(qtq#���q�#����X���eA�X��%���h	i���Zt�V��n�,�/���hi��o����E�dJ��.� 8J����h���aj8�sT� �(.��s��]�������mJ`Q���n������e�?4�n��E��V��x���	��Y�W�-���-IG����t���(O��+�~K��[x9�z*��f�[���}����{�Y��,$�2�v�60���S�
E�x����V@@�5�t�3���%����\Ha��E��lQ�e�R.��l��-<������>������r�y#�$I^�Y>E(��9d)QJ��i���:���q|��GV;��F��FV]�f+���,���+N���:��vI"�Ide��u���L��H��#�M"f�h������*�����1�D�[�Y��{6[}V�e�s�����0��L=�h��%���������+����\dF������'��7���'Y-��$X�l��-���|d�D�ku��I&�}2,� >����}�����'���X�%���ydR����;�\��l���e�u7�\M&3�=9��uf�t�}��m=y-��$�2�J����O��~����q����QO�F����*M��T��9*]H���FS��Yo������J1u��Krc�wB��}�n_�0�����%�b}�~Y*�.��^$z`'����N�+�m?kd����z�
���)d�2d)�c�	����a<�����.����smV�����,C5JVs���Y��f�\
Qe����`"��K��(�{��J	�A�g�R������k@�)�Y���A��fMT����N�+��?od��l�F��,��������R�����g�Y�2d���-��7���2!�~O!KU.��J�CV�����������n�/�e�����_H>�`'��!+������
|��0��U��A���9��YtydiQ��\f�(l��&�����m���F��\d)����M�m��Bu������b)�V6gO�26��lVim�kn�l�������x���e�r���f9�~� ���,?�
&���!��F0�(����*�����RC���AI4w^��:��A�����9'�&:t�><c7h�����K�Amo���
��*�w�
���Dg�V�Z�
�-��s�h��?����p�;Z6��h�VZ��qGk��Z�4����?�&8yJ��D��[Z����:����-�l��*��;3��1-��[�So���Hy��Z6q����Z�-��-.lt�PFi��1��Y�0��Z&��xB��r	Z7h�z�I��r��!�mS+�c�S�0��
Z����	I���.��6�VKP�0j�Z���-h_��O�
�p;].�D�����=7L���Lg��=Z����?���b1���A'�1���	��Ro���R�>��Ud�Q$�wi���hJ����
����y�'�������X;{c�u�iM��{��c����x^�uS�B�<n0��K�G':���
�(�9��1�2�
q����ga<���
���?������'��Y�Lg��b68��\8���YG)������O��.d��ca��Y�P�#���f.����:4�������-���E&�'�4��O�'�aA��L�T��3�45�� ���Y�:�	� A�WW.,�*�
rflF���dg�k��Y�5���L�G��'Anz@�.��
��$�� �NF{&�<�6t$�-fF�6A>M#���Oaf<A�},e�5*��Og�;� ;� ��F�\�SYJd�u.����������9���nK���~�������`g"EF�ge�u>}*����tJL��)1#��YO�����D<=��,����}��8�h�����d!��
����]��G�i�2���mc���������#8������hzc&�$�5�1?���X[GY�L�����}{���}{�6R���kSiD�5���Q���K�F����>-st_�_����b��]Na68��B���r�YpL�::��(,]�:�v���r8"�
�8@��9����cL�M����O���|�����4�PU`�r�&ii�t��e���I7�I���������1A��6�������L8zUw�e��q���Mp�4�&��AS�r&�\(G���i����<Gpl�n�-��s��Lf����QU�7���+����0G��:�� ��c��Sn�L�1��@F:^	�Y��(��%8u��I&2�+��,�g9&�\6U�����6��4��:�tVu!��|�����cy��1+�Fe}4��t$���y�GNh�P���[���h�O��������>�c;/Gl�jo���������K�K�Q>S]�����</G��cL��+��(vW��Gm��co��p	��e��	 oq����Me3f4�:�>�2s����?��Q�7K��}��_��4��T�|f��3j�����xl���
��1�)5G���Z�u\���)x,���VH�S��s����F�Q1y�cg��u*$L��pt���Q
1�2������P'��J�;�-�p������=�p4�����a8������?�=��$�]��&�}��(����*���t(��9�0����#5��"m4��1S
4 u,�+�Hv���������
y�������|j��q^�#5�^�-3�7f��Y� ��Lf�� ��pn��~j �7�T�0r�]	&��A��$��xF��0�s�@b���������7�(�y#j@�����@+T��^��j@�E���R��������j`8C"k�^=50���RG���/"�J�>���I���)�����~�?��U&�O�b3���?����I����<�X(jQ�z�lI�sB��D��O��>E"g>X&�KHe�����"DE-�a�,���B�����Pd3-�Q���p�jC�0�&���P�";vq���sB��e[�l1��(2����ytz�f
OE��ZL�~�=���"��-�B�F�/E����$Z�
E���|��G�l`R�|���+��~O��$����|�T��YJqb�g�cAFB��}���7d�>-���6��i��il�d���e�P5����Q�g�,�4
^�G
x�������WZ�25`3e30�j�4�'��D��#5�[��V��3��G�@C|q�y��,�z�
��p�j�W���n���ao`���H
��76J��������*��2[
D����5�C�&�����{�F�Pe��5 75�s
�%,3���	���[�pQ5�(6���P���k�x?5 ���}��}xo���M%�����:5�2�<D;\2<x�����)��7���IG��~�$\�Q�8������e�h��;��#��,5��h�f�0��H�:J^����L(%f��g���mj�H
��m�c��������j P��q��>,5 Lb���(^��ej���6���N�0m(�=�iwJnj�;�����cx-�Xj���
\�kK
P���M{�5���w����D�P�Z��No�����<�jcY��8j@F����[�xj ��r�S�*�7������>-���������c()���g�Z��z��;���U��1�TYX�MmY����M���@��������mb��[��a0��2��<��m~�G	D���>
?&�n���17u@��f���@��7����P����&�mz`P �yz��I�������z@��B��`��P��M��uA��z ��������@q�@+�Poq�������f�u�N����^^`w�B Wf�VX�)3��U��5<�F�F�����|}@��J�0�j�������>�*	��e���W��be%�������]m�%n�3���2���GD�Fn_%B�����Fun~�
Q�qMx!��Ghk����:�M���FM�
o�a�m�2�l���v��x ��;��E�!�LX��n���|KQ��[�n������x�46�u��b�]@��Q�U��7��%@�&��o��R�nZd�x�y@�2��1��pa�m^��@��lp���
y�%���7�j���������3)C�n��[+�[��&��)[�����n�(���xs&�p���Ut���$R�����on����
�a�����|��d�1��p)�[���P~z�N0�qv)��A2B�}W���
�)
�9d-��h�.5�,!UU9+7�L�����+�A��F���yH���,o
��lF�*
*�J's�e��s�5�\61apH6��M_��l�����C�����M6�MYJb��pV�t�0d�
���v�l&�b��!!�8����?��l�ETXH�N�?���`6]� ����&�lz�-�l��l�Q���e�T�h�M<8�M(�������9��ed3Sf6,�=�j|�(�
�?���Q���I��JG�{���p+�\��Vv/2)��%��������C;~��w�v��;�d��4�{��F@e�8�^��1n��D�a�X�U��cCZi��Z
�ip'D~���T������O����98��
�x���eDq���l������$�i��65�#M	�iD���r<Fv.��A�P�P�qSH	i�3l��l\�t�BJ�1��Ha"��Gx��>���6�<��!�&&�i$��B�E�������i�4�����0���!8KU���U������o��a�%
:�{�c�����2�7��h�.d4���Tph5P���c�
�o�D��������Zs86iD�H_Kc���r1zW�FFEBT`Np �&������y1�1�YS�E�1F��M���vey#pk����>*�y��{�p���p�N��q�p����aax[pS7Y�f�Y7�	��|+9;���n�`�|W!�����7gA����.TQ�
J�p�0J��K^n��*��q/���9���]w��e�\�p��G�}��Z�mp��M�:to���f�,7L:{"��h�q}pSd�4�U9�Tn8�z�[&�������e�����.��s��Gfm��28�77,[Q���SD9��m�o;;&���n}vR�4�u
��Rh�<���H���J5G��#k�(q�E�WLK���$�4]H$�>a;[�1�����(i=@��0�3-���J6+��F�T��A���F$k�pa���8����Y�`���
&�S��I�o���M������vq�a���|1�����6����pQ��q�9VZ�m(kP<���A�*�rn�P�h�J`:�-3����a3�%���+d��I(����2�=�J�
f{�������1������8_e�)�9cf�N�l����	e�tea��Rb���e�bh����~VumE=l|�I[&	e&��2�7�l���T
H&Pfs��������h8z`S�'H���]�I�x3�	�6 J�����e��9w5\�#�#&���Fl���� ����c���t������i�)/�u�8��qp�������R�7p�9Qi�e`�x,�|p&�V��KwBv����B������d�	S�K�g
.&8���(1�����8��JP
n�����i]��SGo�
l�����|(g,61���3�����`���C����_{����	�M*W5����	�D�P"{���	NY�CxWN��!��W���8|	�Y��	�}\'����:����Y��
7�	��lQ��M�V&�6J���p������q-g
�bp�x%=&��y��xz �$B$\���������N�I�_m�T ��MYaK��nt����o���F�����,�
����ZF�56!-OG����e�X�������ZF1��;}NQ�n�Wz��S^�(��b��1u6�1Y*���Y
�\�ri�:-������}�T�4�3x	��B��sW���Ir2���NF�8>FV1�[���('Qd�����7�Y�����2n�S*+�B�~�(�C��"Q��T�m��.'Q,��f�h�Z��O=2�0��"����[����|cM$ZNd�w�i��D�����R^�4UK�RhfjOCSY��s�&My*�=K����������l���:]����8q?pmtjj��ya�
�N@'����\�DX�����)gC�]?<�~!�gB���~������y�J5�R�(,G#�X��(y��l�RsUr�N*}�<Ri�4:<(F��$�i�����5��v��`����v���4N�6=n?:YY���=��b���*��N�x�<8��0��2������)�g
.&8�p�q5��
���SSO{��p�V�q�y�|�w����rp�Lx��<�-7�W18{��7-�:KF�QR�2��l'��
J�@��U����e
�Z������*��	jz�5�1�����5U�Pc\���)7e�^�X�6�:�@���+�%)��x�%��E��U�dH5F�
/��U������qsP���n�b�j��*N���]-����>@��Jo����TS�/�:��X��e�jh���b��E>Tq�����r��c������d��T��Au:��y�BVR��j0u�U����c5��S=����[D����+���!S��A��:�L.����(�
�k��J��WO�N����vyi�|������$;#�<d��p����lv�FW�����i�t.�����zP�#�"��4H�QN{�l�D��X����_� ��������R;-G�*O�j�j��1��\j^Ho�.*�Dy�p I��3a���J���:@U���i��U*��z���46�v��U��P&
E�6T
����S(�.U�I�a�n9TM�')��>�����p�	�C$I�s�V{�
�G!_Y92u����S>T��X{Bb�5@�j�me�d@5�lsn�GI�W��:���!��t��=��t��4��P�aM}��E�O���r�6�_Js���H�.���sn�H��UH��J
��2����d������u��x��|tFIy����sk��2+1q�aG3��4��K�����h�������R�H�����u`�V��o����`������/��&�f���.�TZd/F�!�i�$�|���4�P�����a��1j�1\XT�U�%d}fuJ3���A~8���G��^,&@��1'��r��`��U�&.�7x��%�������� Y�'6�S�K�[�!�"�qE����*��x������s��d�z����!`�#�W�
���m�G�
��.T!�B�����hu0����+��!	p>D9�U�z����~��w
��#�97x���o=!x��vtw%}�M�[����7���sJ�����7��=}��,���.r�Lx�d�a��FHQ4q�����3"b�W���7����)|��7��m���"��L���;�u�9����0�E��w�	X*p��\!��d��,�1�x��� v�U�c7�B����4�y��x����e���Dq��9���o+�BN�V���%1�tc�l�E�H��C��N�����\F���|&�@�_&S�n}��y��������M��X*�5+�4<s4`wI��6�Q��{Y�`�6�*�t���AV"�6���8���,h����5��~��A�^#_�����*P�G::�A�']n|&M�n}F��s�Aw���h��u�^#������V*�����+��;�W��!����0����P��i�R�RY���T�-HV2�A2���:V��pz�K��SQ�(��$�^� HF�VJa��+���OF��c��0Q�Q(�VQf\?�q��}�+���5Ap�VX����n�.K��i�KfP�JiV��T���\c�	������/�7Rt[o��p��n����<���+��Kc�wOb��`I1<���%�]�z�����s�T?^���2 ��)��1���4�^\w�Y��R�|��GM�t�m��Ec��{��'�vLUaQo��M||����+��>�XGcte5�z�:*������n�&�+�}F X�5������VU>Pu�=��4}����
���M���
�j���Xo�K/��AW{x��l�[����t���[�-�V4���NI��m����T&�������J�w�2�!i'FXjl����|_��G�/�[{;��[T�=%l���R�q�*G��
 [��)��)�Q3"�������N?�[9��<����!^{P)�������$�1?����;9�w����+���X�5����mb�P���`,o��3S�Qjt�����m�IYtn����(W��iL���g��
������~N����o�������B�}�@k�U���!U+y�b�����0U@�r�l+2i-4���7�j@��s.�[I^H��6�]4��78_B�%�l�o��bw��;!������������e��+�wd�������Z�8
���A�R��>�Dw���A]y��Df��s�itGG�/�������zO�hr4r�p�[$5Oo)a@S�m�rk�{���#������v�t�����o&;�E��e�����C6!���x#��� ����3�y���_5�k<�����#"p�����d���K���x�M��Z?
ks�[������������}0��~��n�����V}�p�e@z��%c���s.�E���N!zg*gn�9��se��)o�c]�z��q�i-
O�<=�`D����=x{�Ad����^��J81��7	����U`�����7M������m�=��Vy�rm�
�i<��y�FDIb�?����#|����Z�������_&}���!A������N��(��$������\���
h������)z
��?�=��$�]��&�}��(����*���t�0���^i4�A����z�E����.W4��+����~�G�����
�W`�l�	�������>���2[r8�nF����WM��p���C,��JN�koU���JwS�^`~�j�N�+N���+�y��+��,.gqv"S$�S�f�*{�z��-EM���9|pT������^���=�1������B��l�?����z������������)r�8����F�;�q��C��Z�57F����^���uj/�n|E%�ll�~�������,v�.o*�[A�z����y|�z���g��J�_)��^C��W)T&=
3	�: ��S:Dy�;/�oHG&�3����M�*�Q�wc>�2`��7+������[�<��i���C��Q��#��
����Kr}���F�k4��n�i�q�jC� k��A�>�1R-����X��V����SZd�d�$#ZF���F�AN�5�d<4��i�1N����l����lco,�FC�MU}t64:B���^93q�t�~�&O�5~Zh�<Gh��u��f���
����*FK�*��L�$4J�T]�d>G�8���aW��>V��8�nL�i{�=K[h��c0C��^s��a%mn������Z�<ni�����D�m�u��W�	���,�����M>Q'����;�r��^���Q�]���;�^i��a>�FwS6�+
���qX���-���
f�`_�x
����
�#��W��PIg"k8�Jo���D�&hq���|�����[�������,xK��.Y�'�"�uHrkO����t�%�J�����5jyo_�4���%#�����/�p[[����^��"�Z�z�Xyd��6�)GnR��
��+��l����)�ZnzehTk��:H�g��[n�������H|�Em���3��^�B��Nh�N�X��@��W��E�^q:�}�S�����^�5AK���]�~���6�R2�z�
_�%�?����Pyo���7dwV�J�2�z��`�%���t�[�p��H8��cq[�W��D�Z���a�BS������F�
�a��C����~�n,��m*G�~��\����D�f�����&/GU��;;�e���lq���z��-���3"��M�C�[R�XF� I���QvUV:�\-����^�l|X�h|�r%fr;i������Z��
�_1�u�l���"zE��cu���_N��+|���[d�tE;a�_���Z�W�p��4�v�^%M�^MtB?�+��W�3f�W��"z�a�E8
Up���0BJ��w��I��Y���-�2��8�q��.OA�$M��+��AG�d�W�1�Nwyz��V�uD*(��pR�($c���L����.S����[a��.�e��'(��#���c�A�d��M�+/�
��{3�5�fKsn�}���l�"h�1�����z��p+"�k�h~2zE��D"��|�#�b��D�U��W0����*�+2�o�s�J���+�"h�����9�z��cF��M�L�%��1�K$~�\5�#�?�_h���rJ��k�+��`��-�+6��Ls����`S�Z��zD��r����}[�%�
�wv�+0��=���`�m��)n�U�*����_�Wr��4��~���uP�Z� ��Q� %�JDX46���+X���w<i�X�_��1jXr����
�
�N�m��h��_�9�:hxd;�^�r�|��m���������IiVo���5���-�������?�FJ��I����i���X�^����w��~�����������~�c��#c���]���6���������}s)���!�����9���l�������o������O�v����o���y�{��7?�y����������7�^���3�������O_���?���|u�����_��o�`����Wo|}������N���/���O����������?����f\�����W��7_q�3M���������/���Az�w�^|���'�E_|�?��������_���������������Wp�������:U���x�/���>o���/?|�������9�����7o�_������������o~���_�A������g~�������G:
�|������������������O�7����o�!��x���������o���^�}|��=�������n�s���[�(�����=�_�?~|���{{���^������o��>�>����{|y�����7?>���������Wo_~7�����?~���5��<��o�����-������{��������M||�
>���;�
��������?��x��}��>��p��^�}|
o�?��#\��*���{����?���A�����������_~��������^��{x���������������K�������������w�%�������E����k������?��w
���o������^�}��������$��5|�o�������=|���}�_��I���o�o������a-���������.����v{`��7j���J(�R������L����t���������Ci�������l���zD��*��~
�/�����%���?�������.>��}����w������w�R�x�Q�>�_��jo������}�KA��������~
�����|��J~��!��2��&?�_���i����_�w�|����O��������s?�|�i}�o>�����7w����~����a����������nw���,���E����/k]T?��=����zu���X����r�Ov��>�6�����W%���[�{�
���~��~����+~���/����x��5����/�.�������:��}�y}�q��:����wi��c��;����A��>���������_����'�������	V�S�4t�U���$�������)s��	�����Hf���.��#[Y'f>��|�����H�U����$'i�2�<��L�����4�-
_f�G�������2�Y��0�+�}��oi�2�<��L����J|�[��Z�$��^Y�_f�G����=��C���j�)Oi�#��'��ZOdE��/i�,
_f�G��!#O��G����/��#���G
��4t�u�}&����4|��	��Ws�K��4t��	;���a���.��#�T!��:=s�u�&c��LG\M)�����H�cy���7|����J�������H=d?{m1%x��Y��@�������#
]f�G�X�v�h�2k=��~;sDSa�����H$�����i�2�<��YUy1����G�\f�G��IU	��#
_f�G�j~H����.��#�D(3w/f���.��#����������2�Y�����\��\f�G�_����P�����H�3%+?7$>�H��Y���������i�2�<��L����o�m�\f�G��5�[���g.��#Yl8+��oib	���Z��B%�\���.��#a�Z��)�4|�u�}&ee��oi�{��Z��,������\f�GB�����T�+s�u�S�=��xS[�����HJT~���Z\_f�GB�����$��/��#�D��j�5s�u���X����4|��	�~;��N������H��t�w.�������H&������e�y����av
��#
_f�Gk!1��|{{n������������}{��H���8M���8�IA}��7�SB�����^=���?��{�H�>|�`*�7w������?�{]3��>y���7�����a�M�������'�_������#������7��@�J<�i{}�����6�z���7�����g�o��O~��?�~|��P�z�������~�����$�[��9FM��;�y0Z��h�T;�*�\���W*�=iv�b���I�'�k������sR�I+L���W��9i���s���u����`C��I��d�����J+�'
>�����gj�:���(��L����X/:'�.
0��s��+ET��_id�3���GrNt��3I-�g���S���
�g-=�=���D��<���j��L���W���[���au.;'�.�*z�:���-&�$O��+��b��{��I�C���Y�iK��(R�*v�	E)���������'�8��):g-}���(R�*H�9�%L����t���Y��a5�s����=	$�lA��iE�]�����T����L�����
cA���gu�����K�.H���BUJ�>>F8���=m����W���Hi�=K"ic���������������s$-�kD�4~�0��wO������{��b����NK���If�CT�	Ih4����iM��t��IF��}���`c����`O�����pQ'��w^�y���������y���+0��������pUl�c�t�����q��FvO�%#�oM�4�:8�����K�������N�;m���{.���%���!wbL�4\��"��{.iT��SE2"]��-�V��O������t���d�a��=m���� ��n+�������w.�����k:����$Cm-�k�o�vX4�`�<
@J��h|M��j?�(�:c�]�p�����3��q�_��5N���]�`�^	�����%A�bZg����3x�R����,c�5x�Z��u���3
�H�:�����u���i��x�u?���JE�>Ew�z}�p���O�}����H|cs����3��{���/pd�}�~l�5Z�?�|��u����Zg��ku2��3�#���^m�56h��$k��{�Q���l��l�_���3�k��3�^#Zg�k��	����3��hl���S�}J����k���O���fD�=��)�"�n�J��k���W��F�.C�T�2_]m���2�}J��B������S���zc��:�>�w�����+%t���;`\���1�S������{��}��X+������}����c���A�
�D������:|����GhlYc��A�j�JWp�St��iol��)K�������j������>����}��;����o�~�fh��������*�9e�UA�O���l�J���K�j�_'}�}�!�O%,��H��g��k����"D���8�����^�=gi:��z����:r@Z��g��X0���������{���u?��d(r�=
�f�A����tp�z���ZBvO[:��\�7�H���������lN�S"y������g[�=mwV��u�#��+���.������j��T��f���>�&
g{���U�z�����������7�;OPje�<<*���$k��uO���l{��5����{[���,<v:�Lc2���v����xZx�=
���A�5�Al=����J��{D�n�F
�m�4.��5����b�����1�h/�������ir��D�(��i�
 e�<.�A2t��(G*���i\q�����&�^�Wm��R��e�\�9��������z�qe����B
H���4����.�#|T�n�@�����.��{��H���u����{;��\\)�"k{0������x
���"U 3��8D�������.��A�&����+�U[��''QS RR�������N�d����`pO���OXi��Z��+���������=�v����3 �G����<l�A�0�v�x������Z���I��J]p���w���/�f���6����}jD����'R*�������w�%q�wX�X�������~��6]�kl"��]�{��q��Z`����������?P�"����0v�����@�����/��k�M�Z1���������vH���}C��_��z�
�W?�=�����t�k��{ks��8����,�~��k�T3���_���������w��@�������S��?<�x'���s���J��Jr�[�t[W[Z���Xi��o�k�^��n�Q�m�������(|=q��z?�.Q���[��)�A���wt�0c�"�WD����s&��-!U7�P��C}"���F�l����7�T�������s��9n��a�uya�g�"���+	�w7+oq�u��R9X�6���<X���Rq�v��k��:�Qm���
`e@�����������jD��g��V�+�W���*�
^dk'�	�&�� X�p{��'X9U
+�EVX�2V��mkM$���*;f�JP�_+%�����<f����Ur������aX�4��k������T���V3�t	$��t��h<�cz
�
��x�11�^g
A$�N�FA��(�� �=�����rXKOU[}��p����Rg=�%�����P.�~X��c\B}d��n���f����T�����i��CTd�r����Ci�r����C���f���s�CW�ew.����_�r(3r���%�%�'��-��������+X�����CYi��C���p�z��V��:�aq�:��%V�^���3����>(����vTex�����R�U����G����?����;�u�D4-R���z�m�a��2�4�)�sW�5xh�������
M��IP>;�0��Fc�	��=^�E�4����s14x�d�f��xh��.�HcC��/m�\�6y\&`�?�����-FS��&�c-f����LMQ����	,�u���2��>�C�����N���
4IQ��z�eMT?���*�6
����Z��i4I,Q�6)���g=%�	-o
M���t�m�����^�C)�T/�eK��8��TU�M`
����(���������t._���5)��T���t��1�Z[��F}�zT�&
g��O�I��aj�@��6L���:���W�.���o�nS>a�)�(�T�9LY1��K��a����(7L����2���0�</���)�	S&�,��/�Tf
g9�m��J��a������72�T��0��YS&��	�}���K�j��,Li7;��FS�m������������\>�eL�����F1T1�������G	���B�N���������c��3[�D�AA���Q��3,Hu�0�K�Q��~qQ���_��So*��Bl�ws���#��Yg��J�`"��.�6De�LD���HuG���*>�"���B���CT
��|�F'��_K%H0uP����M@�yD!��H��~ED)Q����#�QB_7R9�(�!s�I��#~5$��2-D�Z���m���E+�XA��>*}�^�W�6J��:M������`"j�p��7D��|��mT�%���E�Q�A�Dql��
��a��s��8�l��x�������:���B�i����.�qeT%����GV|�w����T'�#;v��Bki���wx��]�����������2�\���
h�a��mn��k�@��e�-\6��� ���I���;V�-�*���h����� �9BZ��������e�YK�0��B�a�z��X�1�BYIlQ�O��yij���sWNi3�X0�A�m�-���B �GL<����$��BS�� eN�r�� �u�F�Pyi�DJ�JA�t��L���ga
#m��������pB_EZ��Xy��%�R$wT���xi"-!����5���M�iuNK�'��l�;�I�4o���5�C��MB
�?���rP���&'�T�aO��VQ�P�u!��b[OlUKl�X|�<�w�%� ��-.��b��\lC��Z�3������ ���M�.#��2����>�3����������#��u]�y��{��=$`S�Ml��XY���)���i[����p��������\b��'��-�r���[pE���|��
�����'���O�&�+��!���}����8X��u}4)�6bI 2���8VJ������T���<%N���,��I4V�����I#3��0�u/�sF�`x '�������Bmj�7m�x�I.
�jA����P��Wb_X3�B�Yyh���Ba
#������[j=	:Qy��
�X�J(�����;��P�R�:c�9���pZ�L�o�������������\R_E1�M=//�CM�Q���_�si����]lu�C����J�qI�x0�-���k�!y(��2C�����������r��U��Q+�Rch2QE�i'����qI��".�9"���n����(�#���h�*T���G�g���{�i
b���{A����w�KAh`���b��b�����S�L�������B-�,���]e���k�e!�������e��A��z�J�����t!E�2 ��<��G5��v��pt,BA���$�.��#������6���ZU
B-3Y�8���}!��5a���m�����%�Z(�m
.��A(/O����x�B�4����pT�%����a���,���U�>:W�h������@<��� �	���:WO�3�WO�`T
�>��TX��t�z<U:�T�G����R�b��"k�]cz�"����Up|J$)�p��ky����Rg��t������b&8s����)j$1���H�p�W::@�Os�Q3���H�5���c6��(S[��3��`2tJ8��l�/
��H<4(/�l����DmkB3
�w��	D����
����+b2����-��d��\{Y�n�M�$&M�7]��v�!�K�7EH��oJ�W!�-ZD
��C��i Z��Mm�����N��!�|S\�n���X�7���M��l�4&��#����HT�L�k�AV��F��8[��A������FM�uGF��G��WU�u���92�L"-\o��{���f/D�|)�T(~s���_���������n��$L�2'��BR�!��?<��#���>�?��[#�	�}x'X}l�M�c>p��| 6_O�W���]����	�_���pp���+�s7!�	�}4�+��o�sJ���
�r����p���~�n���p���f����5���5-�+6����p=�ETk���p������	�Z�b�����D������b�����hSDB��U���&��*k^�,���>��'�pq\e���������'��	y����F��X�x�	�GYkP�`a�M��d�������I�^j����G��J��2;�����yvY�a���H�������\B�M��=�A�-.���1�0 ���T��rW-�IX�����n�<OY������s�:e�)l�� �R@��x��nQ�o+��m>�������Z:����u�u���A7l���um����c�0�%�*lz��R�N�tp
G1[��$�2d7��dS�r�n��_O�Z,�=�Xl~%�h�����uP�>�D����"��������t�������<?lo~(�4
V��5M�2A%�hs���5���m�����-�pt�Gn�nhQ�;TQ��\p��
P�g��"�Ye���x�L�(N���pp=&tu�[����t����k�����
	������C���
���9o':���f�Nw\>����\��D���tC�5b
��/{��/+t�-���A������d���1��D��N4n�����XGH���~ut����U�p#V��5�n����b�U3��X��*�����d��X5�	�9�[C��=���A=o�}�F�}���9�D.�M��X^gyF�s�i�b#-�9�X��{�D�4_v�pK��rCg�_���3��#�����!H%VSn���kT�����J�&r���{��i=����B�0�W��l�L���+g��	zL��[R0n� WT�kite��<��@e���7�\j�f<uy/D��q��D��^���c���c+7>��1�H���JHZ��hr=��_���mr�=�r��F�zJ�D�`�>p����-���o������5&���	6s�X�H�\���9�*o������7�C�.�U�b��+�1[0�8�Awt����u�4MBe�5sJ Q��kfQ�^�O?�<
To���L�@*
z�^P���<*�j{�[�5���=��q���,F�"���1�=��`Q�3�k��5��~x�
��iW74�w�&o
�T����o��E�!�J`������v�b��-�!f���"�7��/���U��k}1�5�R4�)��j�[����C>|�v����7�����u����:��-z�u�q������P�����j�[����CJ�����wt,�9Sx�^��:�q�K�`{��I������8�^�I���63����	|��UY�dT�Vo������D9����^�	k(i��*���{���U�������C��Juo^Y�
5r�w��8�MJy������k�z�'�G��$�K��$�O�%�S�[%�W���9?�uF�y��h�sa����at,�34,
�.�@��6[g�,Mw{:��a�|S�T@�s�\��b��k��y��h���a��c�a*a��1[^x��3dC�o�r+:�P�u���fE������	�L�q�a�UX�E0�Me�>�-:Hqh�����T|���sa��TF��Ded�������^N�<Sct�;��*MF�"*�%:/��g�����������>wi����Zfg{8�������h0_�3z��2xra��s���EO;�<F�y�Im�����f%��l�������)�M��=��������}��������-L�b���������D�6w
�&m6�h�����-�%��{��'�6W��~�6��mR-_���t�����\��E�ms��$����2�M����#������_>��9�-�i�l,:8�$��6���JVK��)���^�:�	�-����C�'`���z,�6�f�����8��)���X�
w����k4%xhKh)F[(nt[����GV�t(�U�MFV�I��;-,�"�w���*�k�"�,��.M���y��g��*�PN�9�#����6c^�j*�Tf���h��!��p=:� ���QU"�Hh�!0�Uc���:���Z���!3���[���nL�<Sg"VZW�m�E�c������1<�#/O��3l��"~���u�`�Td��:Z ���`��3j�3u�!��$�d�t����3�JQS78^�[g�?�m%���:C���]ety�0�g�L�������L�a�:C���mt,�g�*�r��K.���E?�>���u�Ml��Z�g�\�;������c�!,�������V�2>�:�
��;�9���?J��S�</������P�z��J��g,M���y$g��1PWU�U�a��&g����i�7?cpm�0����3���t��X@gHL�A����y��\�a���SK����jJuF6
��OBg ��:�E�l��3F�:��e��vk��:c�B_��Sg($��)R�:�f��h���6�5��:��u���^��X��Pim����qrp�$�O~���~����&Z�g�L~����32:1����1���b��c����6��]Sg����+�.�l��gX�����3rk�R�'�3��u�Q�XB�^���������2���I�]ei��]����^cz�$R��KY�2�%��'(���G�&�3`��%S�!�T�T�1�L3G��-��r��z�<Wg���	#6����$��3Za�5t���g��L8� ��:����p"C��S����`4A��3AP��Q�f*�'�3���OL�p�&��	��0�nMm�K�������e��0�5j4����A?#A��3ZAPO�R6����6�%jQY��:C�����]\g��g�P�6��t�T�m�d0����H,�q��R_u����3\�x�_#[X���0DL�DN�%�����l*nc���39]0�Jm�������q�w�k�������b�����8���5D<�k��U%��U���a-?:N�\���H$�_��~	���=cFx����U%p'�~���iC�N��i-O�y�$���X��}|�`�H��+X6�4����]���e`Q�O
�\,��Y��#^���B���K��������!����,���`1��e���L�$�����-��%�������Yd����X|)X�p�?&��E|',,��B�����U�r�B�����5K9X�7U5�Y/3X�5�^?l����eL���>$�I@�' !`�(�������1���JD��F���?�X��UU���%a���:����������$g�����L������a��r&��
G�tK��������%=�%�g��N#�Q9��I��	���I��4-��
�?9��6������,��c���_�Wc���4������.Z��L�L'��W����9S������q����`������1�B���Z4�(
$Pa�<\4kn��X��.
7���f-�������h>���@KZ4���#�r�c7�Le�.�8�E���E���E��.s����W�v�]f��5�%�%����Z,��r�o��=���XB�p��,�0XdL`�����4�������eY,������b��
�c�,��	,:��kY�'��w�/�}[�I�2\��j��.�����,�]!�����,���M!a�Y�N��%7,$�R
�L�2L�-�i�,�40�����mk�Q�` dU���`�u:������S����h��eij��=	G#`1�j�LQ�(����|"���+��K+��m�eI������E<����n,��%�kS2g�Y��s��`1a�,�Ec��]��a�`�"S��U�gY�,K�4��Z �%��V&�Y��3�I%a-7L"3%�E�X�d�`Q9��X��k,�](�`�m�l����XK0�h,.��������Y2�,0��kD������U�+)���Q�0�a��V�,JB�F�%�b��xWr��C��g~8L�n��Y��Je+�]-�^��'xS��}�	/)�y-q	�����>?����<��\B���8���%C�A�bJ����W�k�����b������Z�W��dM)�-��maG�.�+
��s����<~��<�����h��|�b�����lK
P{�f���R�n��u�Z"Al�)�-6G����kP~�2Z�p��%IG����t�(GL
�.��V.�Oo���
��M�d�����7�*p�g!<!�)��d��T����hN2N�7��M�w���
�F����,�%���|��ld	�u"W��G�u �R�T��4����>�Yv<z�GV'x�F;�g��X�j�Z_p�%S������"���*�Y:�&��t�z>��xtYm^0�F�g�,kO�`�,�R��!K6��&�%��:��Y�!#)FV+(!�f�zc.�\�O��(�f�5���s���������jQ������6��q��]��������IX��)��Jl^��r����E&`T�Zx�)��Lk�}��)V�dr���X&��d�n�<��ji-C6v��L�R�(�)������29�/���6�Y������d2f�?�IS��g3�Xi�}>���}�l��-��|F2�v7�]_&K�t��x��<2i�WjGi�?y�?I[�g�I+J����y:�L���GljX�E:�=p�J���]Z-i/��Z����r!q��S��X���=h���/����I�F���=���g���Y�����E6�
Y�1Y�rQ����8�����(y{�2�Y��:o�r�AVM�V���.-��o�����R���]�fE�Y���k@V,C����������elV,����i�!�7�"+$o0f�.�����4�Q�
�U�
�Ln�����5Y�]Z�"���s���K�yu|,dI�Y�d�}�\���S���e���J�el�/E�1]���z���(���t��U"��������O��f��}��F������0��2C�z�5��������<8�jq��5���7�B"�eVp0��$:�]g]Y)6����zZ���tdMpk���&��r��c.�\�������D���3������e����� ������5��5��H���E086K��3����YZ�R��a�t��
NpE������bCVo�D�	:y���,��J���������Zf�U�s��D)�����V��s�����bm+8(�
Fx&�� 8��L&a����9�;����V;�mV�7�B�RT�;kZ:��7:x
�*�+�����0<X���0����bG%��G��K�)
��^�t)��i���*N�8���[��?�B��Mq��;Zu�K}��/���9���1�IG��9��}+�5���'�c��rS��x�V���re��i���"�c�S��0g�Vg��eB����-��@^.Y� X/�OoA��
�R�������(�7�b����������x�fu:/�p�����?"�XL�.�c8�1^�u�N�����7f�Q�$v��*r�H���U�qpL����3��Xs���,p�z��G��p���co��#���8�gE"Y�Q�����+��/e@��)������c�:�;�-��3����Y�@c"��4��	pl�>N�������^�|�'�����n�`7g�7�:�1��VM'�Y���n�pp����:d��������;����~�������6Q���S���@�u
�\�"��BT!g
����f���^g�$�,An�%:�q��&�>	r6�~iA���U�� ��dF� �Hq�a��%2Zw
�	��c� �LJ8Y�+d���-fFk6�|� KmP�����k��q�����B�V�������t�a����#���Q�D�x� �Rz���~H���rAM`E���-��7���*{�����1��x}`;�>!�:��Xuin��MRb���EqJLv�~F"�Yb/)[�������3��1�L��2��R����d��QFY�lqxW
��i[���n��1�JQ������,8��:8]:��(y\u��b�����cm�F�E��6
�c�uL���$Sk"�pi��S��R��)&�^=�7
�x-�X4�W�g
���������s�o�:�N��I8�\V[pW����*��1�V��}8�T��~O'��T{$</�����h
��8f��[$-�p$��
�0���rD<���pL�b�����9��1��WX������8���y��M��I8�\������1;�G��V�
��1�.��g�Np��U�!wMp4����
tE�A��:����������s����H������Q������8�B~��FLl�e�b8&}Q�6����co�����:F��Yu�{v��+��)tV�.�����$H������<�����v�4��#k�#���	w�a��0��RY�PhuF3���t�b�xL%j���s�	-A��6�����<�}����c�b8�+�6�8����mx���x�6Q���'G[��&��J�(s���kO[5��o���&n��1��(E=s�����B���S|6@^y������L�f����l�hR�Y��}��#�-�1&<���������x�aG5n3����c���0�����c8:�c4�����|z��W���iuu��M��[7C�\�G���O1��L9Ll\#q�{$�I��$�M��$�Q�;%�U�{��P��sP
���#5��"9�%���L5`�}���+�U������j 5(��L
�`X
�UXY��@(�����#c���l���\��@������kP�l�&���p����Z���v��l��1[
P�w�����(�e�)]�(�|��;�������|8R�P���X�x�j@y���Y�sVl@�XEwj����l�.�Rc�$�@����u��e#��C)���2	w}0xYi�B���G �����V%3������eC��0����(!�����_���9��H��md����(�*�)Y���V�
E�(�`!;BQ�o8,�=�9�HkB�>���$[2!�c�.S�-:�
E}���_G(j��,�sB�r�KA��xt�P�3��L
<�mM�����"9N�p����_����B��i]�����mQ.�p��'��-���!�n��DE=�gS)�h��tD�G?��`�.�wA��C6{a����S{
�r�����cAFB��}���7�00����P���0�y0��bf��d7��Yj@��[Zpo�T>% aV�Jj@���h�a�T)Q������n���!f�B��tK
l�+�1�Hj ���Y�@�7��n9^T
$������Mdj�n�/������05�����s�@pR$�ZK
X�Y�Z�[R<o�d�@�65�WZ��{4�9y65�s��#o@���Muz��ykj@�bo W��Wi���(��=����W6���Sj�]*���!����ao����4_q��[p�!B�j������y�Fo��IL�Gj�E[���
��\5`��_G�NK
Pq�d{�W�a��T�0��b�j��R-�t���c�u_���)`��j���\R
��S�L$�*� �y
j�<6pL�����c�7 ��K+V�=� ���<�U�me�0��	0�fS��/yj�E��F�s�#�7���@%o���SNP��[C.���-}h�P���-5���7 ��no��*�m}�����1��vU,��n�F����	�<=��#T[t�7��:�(u����[�&mJ���#T����p[���-CU����J���po��u>1����OL�Rs	
�@4��3������2O�2	��6
{c���q����	9��8�-%�MC��]�J!�����Z����Bo����x��c��d���K��(Nz�py����e���������?0�s�V2����ve
���|}@��J�0�j���h����>�*	��e��#��e/�X
JzW�0�
�k!���**#���B��X���w������f�^(��+oZ����&Bt�UI4���
���X���D�!���k$�����C���%���X�:.����;�Wnm��Z�M��Nw�e�zm5Z�h;����f���
m]���[��F����
o
m�p���<��NR�hS�\�i�Y���V���6�F�s�m+K�-�CCLN"*��Zm=�`+%�
�
w�dP`��=I�3�4�)?k.��c��C'0�k������E�C���6���*1�m��6�Wom�F����lN���I�h)F��GI�G�\�]_}D�����_N *�N��0� [|=&��KM��6$���c����d�a;@�m��xG
I���y(SE�	��������S���K��eX/��`&�b
�nk�5V�X0%�n��t��/*��������1��	����,��G����+�\����H�`&�t6�D
���hh>\�D�L?�x|L0m[0�
Iz�Z0iO>�Gc��1U��4����z�3S0�p���`�(������N��	E���@z*X	W��t�U��Ux�"���}���a����'D��X�!����(�cb�J8x��3I������,xYV�:��'L��r*V2��]����3��/,�yn}7g����]�$4��pF�V6�L�a?+M�)#��g5N�8�/M����kd?��p��gT�C�Zb��5qhG8`��������+����e����7��0qf~c{�Vj����=�o
����U��
��8�qf+ng6�LhY�3?\��&���x�oL8a��p��C������\�=WF�{u-i�L�E����3S�>�:����mf�\48)����ARk�-�br��w�������$���Yq� �7�.��(H�iu>�NH�X���w�F:��h��U���m�Nc��Gg�y�l���
T�P��B�����Amr,5�@~$@�P�~U�a5�u�Sr�.5�P��,���a���P��?�*L��!V`V�����U��J{2����#j�����������C����l��_5�Y��l���Z
&�P]�GrklP�@M�'#�|�W=/c�5�^��)���&j��`P�8�8�qv�e&�lj�n�_c	���{�P��]k6��t
�6E��Ik��B�Jw�H5�����mW��*�P�s�ZtdWa�-�����[D��������c���}*�������.2��1�"�{&���e���o_D�G�I�`��BzChCL�����"-M���J�1��!"�\e�J���d ��S�#X��&���Y���-�g�
�w���.�o�j�pAv��(�-�@66�������w>��12
K4s� ��AK�����a>M�lO��f�,a��CD��5z��6���EK�H����^��!=���2��T�������� ��K�4�v>�j�pAv��P����6��-����5�>�hK&0_#-�&-YZ��f��df�4�v~�~��	��u����u��g8zXS>�p?�9��M���vR�HjG��<���e��9w5���^G����Cni"��P�q��)	w�E(g��!�T�S�����\�G�l���>^�dD��
�%c	pJ_�����t`�����r]�9�<�^6A�)�l���0�e2Ih�M��nl6�bbsfq��&6K�2���N$���~��DW5D������������4���4����F\l��bb���"Vj
@c�f�
����d���e:�`7��,��j��	��c�rlf��h����f�-&6�a�)�}�������S!nj���,��S�I�Z��+����T�II:�Nmv�	S.����N��&-g�"�BJ���lX^�=j ����yT	G\��Y�uD��W�F���d�&�e<>>|��� {��g	Q�E�)��-�f���sel�@y�tt
��)���lQUJr(�\V�����%����P�����(�2���@�����D��B���+��d�����geE�(��(�����������1$�xtVQ�^~��2�z���,��D�)�����_!������DnR�G����*��Vv$�g�Te$&#�9�VFQ,�9f(�R6�S��8�)�XCs8)������\�#������LZ{
��?!M{i������F#���M�mK9'N9J��b2
���<$(M��
�cpR�����������zwJ��+�DpR+�bp�m�����e���X��	�C��Fbg�X��31XZ�;P�be��1�i���Ml�����0���p�q]VF�6kl1�i��;xn�,K`w�����+�c�=������Y��
��O'�4
���^m��*M�.��X�p"����p��k��yB�4��E�<�%���Ng�D6t���Q�����2��8�Q���M�*g�$��X��ug���3"J��O�������h�� �qk��w��L	8A�@&O��[��w��?��&�xa��azO����s��x���[Ll�@�G�Ry�����!�A\qS9���i��;�����M��D��A��s+N�qw���[Ll�jBB��Z�"6la��8�z��{�X,'`C�A���8.�M�#�0Q�I�*���f��x6l6�bbs
DE��D����#��	�.�iV(���"5���
�
����T�4G��J���$8����mq6�bb��J�2�m�,����r����`�Y��M������o�BF�i[��K����R����a��V16���07�����y��C�
!�)2X������2������,h�.3MX�IjO*oNa���<��'�vF��)QI|j��]T�	���z3��)we�K��5�������V�k�j�8��3x����WN��&��b�h�u��)�t�G'�W�B@6-A�il�r����Y�=�~x��B���;H�_��z�
y��$R��"fT�:��
�5o���T���*�O`����dF���f2li�{�.aFkhq��!S��*�k��|:�� ;�L[���<
<>6��2�Wl�	���M\
�PI!��iC�Dp�T�Z\l�<8^{��������r������'�M����p��`�%�)����g�4�g�R��5E�,��������l��i|��U��?��Sks��gU����`QX�����,U�jXB�FZZ�f^yl�R���JG�.
�g�����)��[�����Dz���m��@��$��b�������e����F4���hM$}���-"�����t���JP��xz���n�E�!��l%������1\�au+Y�N@���+�O�\�D[D4v+��&��CS��2��5 ��>d2a3%�R[�
����2�sG�r�6D|D�H^7�~V��h	�{U����)w����
 �����`��*�������2$�X�j+kVv�BB
�#?����!M!h4�~��iH[�-U�)W�v�y���3�{�-|�hu)����s>�/����5(��}e8�c���AR��p<�������Z�hK��`��:�|�����ta�D#�@H��jB=���X*�BAG����������� �w'Ng5= t�0|�fBwa���X�Bh�M ��M? YX8���!q�"g!4�]= P���0L@� �K�g��S�t���_�	��?D�,D�G1�{��f@��fY��x��{���x���l�����d!d�l1 �B� $��]9 ��a���j�p���F?h�������i��+A�Nx��'���l�&�E�,*��:�T�(j6�p��R����PH��������f
"[�6���C�s���:��%��%����O��L`�i��olX���j�#l�5�;b��tT����&�q��z�z������������l,�n�H��v�}e,//��n�����wBj������F2�YMv�<�^�x�]�����=������^�����C�nZF�x	V�#�*Z��U0W��|��6p��[�9�H�rs���
 ������E��m�]4�A�A��z�@i��	��K!���<!�������F�\]���"C�ES���
 ��cv�,z���>��SRU�N��H����v��� ��L�5xD�x?�dki�����7��0-���vg
�����i�g���eU]#�����xU�c��FU�n1����u�}H���P)^���m�$^a��	>��7���\],�����i��������5 �[�,���� ]4��VV2�z��8V�� �S�?u�,�-H���i���������j������=^�h�l�.��;`�9����5;�b��{�[����,nsYi|.���<�k�wH���-�m+�F�G��hQ�M�#��6�C�Hd�`n�����JF���	������'��b'D�f����{�k����GAq����`o}8�v���
6y�LK�^f�&)x�W���{�$3�/��0��~K	'Y�T������5�-v�G;��"�������m��-�#(7�����/|X���������su9�������y�<f
��>��E����'�hp8+k����]��nD��<���5 w�
r0�r�^WB\�6W�G6��2�����.K �;kD�L��#���6Z"���)����������Y�h�3�j���f&l��F�!�L�`=z2�H�M��x���^7�w�'q��&��"Z3�n+2[8�^\�:����U���e���:�F����2CI�v1�����
�J
8+���L���W�(�d0�5��� �����'��sp2
x�|.�MS������Io��.Io��>Io��NIo��^u�:��s�4i�v)h����V)(��f��J�XB� ��Jm@�(����zE�^	�oQ�H|r��XNy�L���|�M���+�&h������#��a�h�1)�E��s*��E!Q7+[�w7K�]d�����V�A��a��r%�8�~�l��W+�"��|�I�z�k� ��4	U}��OV0�����VV����&��	m�v��]����&�)���2�I�"�j�CP��RI]{P�	iN�����$��|j�
>~��t`:�'Ar��*a]�Obs��8�v�}�'�I�"(�$=�y%��q
B�[���+
��H�Ds<�6���eK�x�N!�>��������3�U4�L7_�,Ky�����5(�����1L%B�DX##iC�������(k�F��K*d��y����p��t�)KfCc�
~h�3L4���%�f{c)�H�K���e��E�)��mc&Q6����8��?-4NQ9�����\%�oC��TIdz���������&RGm��:t��W5���k���h�w�{���5Apkl�mhT�F�GO��N�Gs�'��)F��7��7����k����e1�m�J@�-��h�����(��N���P�y�[�E�&u�2a(�4���6�H���0���<t��U�)G�- |���h�C��a��VN|ea�~Bilz��FN�J+;�����#)}������{y��M��=W��Sn���4�SWk�VN],_=�>_������R4��+VF,;�jz��*����A���$��@5�
�
����*�_>�+�uP��U����^�5AK�����Gz����E�j���z%��v<
B���XE���}�%N�WlE=�,����uM��b�rpTk��:�r�+�>��s�^)K��`M%R2��K���+�K��e��	�?W�W|�|�sS_�P����-��rH�4���W�L�^�w����w��@t�z�
o=H1��9���"!'��R��uP�2�z%�X�� Ohp�4h���1��%c����*���3�h��Z������F�
�a��C��WEa7�R2��+AilW��r�+'�d��S�J��@��T����b��r��4�q+Z�+�&8�c<S���F�)�v~�'�W<.I$q�Zb�D��W`�A��$��]\���P�����FNdZ����_�+�&h��]��,�\��J�XD�x_y��
6�]U�b1E��{��^��F�WT�)�W2��4��4��}��h��:�r���������m�XF����A�C��������(n+"V���^Q���<��*����:���j�M��W�&h�W��4��]$r���O`�
��W+��_�%��9��fTlYBG���m{�)�f������Ni�Gii�	���5���~�r����H��k�����>3���5�rP�����Wdy���cFJX�G{��'�W\��tW��+�2��W��������5�$���r
�E�^��f�9��:h��V-n\[���tCt�������XD���I+��vpXW�H\i�T��4������<<I�W�W�������m4_I��_)g���������=���p.m3�}�<�������\<!�rz��(�Di�^�p��&�����J�Zz���+��-���l�����^�^!D(l���k�u���?�
��4�+�+��uhZ���\�s��u��:�V�u����'8>�=�����4��������B�C��p��0�P�g��������:i��g;�����@���������M��F�f�F�f�6+�Y�m<����W��}��B��YM��aG���;���s�������O�_�>����O������A�����������O?|���o��{��?|�y���w�?}����������i������������7�^������?�o�����[_~�����/���������|������>����w}�����Y�4�}���������<���p������7�_�����������������?�������O�>||����o����S�}����w���/�w?���Gx�O�.�������}������o~�N|�/_��w_�����_|�9|�w�^�������������~���������x����_
~��������������N~x�������w�<~�~}��-8_������j�����w���������������o�}�������O�����_�=>~|�������7?|x���������<~�����}�k
xu�'�	���-������{��������M||�
>���;�
��������?��x��}��>��p��^�}|
o�?��#\��*���{����?���A�����������_~��������^��{x������������}��������f����������k��������n!N���W��o?��_����>���~���=�����I��k�����{
��/��{��?�����~�����������5�a%���������>���h��f���J(�R����b�3�>S����>���?��u4����?����'�Q},����)��8����� �?����w?�j@����������w������JU��G����������7�/w�����=-��_��ofw��5|G�O$��a1Y�O~�#Z�U����g���7?���������O��i�P��w0H��>�|pl}�o>�����7w����~����a����������nw���}������H�Cw���uQ��c��~w�����G�j`�O*�����3*�C��m������W2G;0���������������������B�X���^�?�����2�������9�����w�H���9y��~�*?&������������~�����������R_���j3����#
]f�G2�i]o�=��z�����H�#���o�L>��e�y$��t�Q3O��G��Z��
E��=R�|�����H�3�-(�bi���/��#��D�Z��X��4|��	�b�#��G��Z�d�an�#��G��:�>�5W=Li�����H�NF�����Z��C����HS�a�2k=�5��v�#��G��*��8+U�s=�	����Z�60�US���e�y$���	���\f�G���q�z(x�����H�R3�U����H�3	�W\y����Z�d�-������4|�uIc����j����e�z$%��W��2�Y��h�9sq!��4|���Jj�3����#
]f�G2���j��4�H��Y����2a��r[3�Y���`���g.��#9\��rh2s��	{`������7|�u�&�������6s��Ic��A��G��Z��,s����
_f�G
�a�����i�2k=���C4|�u)~&D�������e�z$��q�oi�2k=�(��j�L��eVy$#>���n��4�H����H���bi���\f�G��	S�����E�2k=��H������Z�d,��{�S���e�z$*����&�x�2�<�BOe���Z�g.��#i+�����4t���ZV��KSq��e�y$����sS9&|��e�z$���v�#My��Y��`����S
i�2k=g������H`0���=L������H]d-����G��Z�k3���rLi���,�Ha'�gR nm'X��m�x������_���{x��i���Gyx�|R���o�����w����z�O�=~�{�p������o�^�����|��&��}���wo^���;���4oS1��Ov�������h������7��@��<�i{}�������=N���,M���o��O}4�?�~|��P�z��������*�����:���1jr������}�>�v&T!vOj|���s��I�]����KN�:��iW)k;'���sR�IM���W�9i�d�>A���=�:'���e���9���:'
���uN��U���uR����)�9�vNT&��I��PE�=i��2t�n���	�IUJ����-|@�������5x6�=kq�d<��$�b��I��r��J�'c���f�#6_����J���-I�5U��sR�I't�����w_	��*g�'-����DH`E ;g���)��sR������s&E(���k-~��u�"��[�9�����sR�I:'
�4�yF��y�N�PH%*���MR!%N���Y��0��0�3������%R�dDlj��g�.�7�l���y@����_+Z�Ag�.��Fw>�����g�j�4�g5���{��Y���gU��eeY��Y��P� ��5�n��N�6�T�4�����>��A�D��"Y��uO��JH�=kp�D�t��k��LQ-/0����M�t���$����yxJ���}�<<�F�>_%�A����JR��DKM�$6R`Vr�4<h���}t�X��>*	h�J�=
W�3/*��Z*�=m�;5���=m����9��u�{������G��8�=
�t����lK���L ���N���"���Q��i�$Fd��6tZY�=m��c\��I�@�)z����
C�<�D����y�,���y��B�SOL���*6���ux��5A���:c�5�3�4V��H�x���(����uF��~��w)�:c�L��Aq� Q�uF��:��,�i�1���n�F6��E���������F�0��~
����u��������}��Z��)�j�y��7��)��F|j�����K��l}�L�l�:C_cc�kO	��I_�u�u����
��k�u��F�~��}N�U��k�F�Zg���h�,�S�>�����L�5J�:�_�
�u����~
����u�F+g��Q������������[wP�� ������qPJL�}�����$���$��7�_���S�m:������+���cV`�U��w8��Q#l���U�W������2�Sx�V������O����O�kul��o6��)�j]�U�wkt����7^u���9GOM�>��_C5����T5���kE�|��S��kg���x�����z�������O���i���7H�>���W����=�>�Z5�Oi<�����l��>e�U��������y~���Y
re�~�no�
���[c[1o�t������M5���s�ee�j`y.������z4��=
���%��
�v��j�``xu�4�	+/~Pcn
�Q�Nkzu��I���?,)������`�������i�$x2�^�Lp������,�����E&�oZu>����oL�4|���2QwO�%a����K��cT�����v.Ic=L�2����&o�
>q��D�/{�1��+��w������t�������{����tO����|&�#6��|6��w��4�{Z�i#{��tz�^4�
��^e�W[z���}e(��������
���=��$�DtOz����N�~�<����NvO�Sb\���i�����6������>�2�]��)��0�J����`��V��K�����Q=������@��B���}	���&8���?�u�����	�Vs*����-,aEP�?�����~�
���{��6��`���50����\��^6F+T��Z�/H9����)1y�w�E�z-z�1�b+���%#�������s����fU�xP�gW��vuobI��D����k��30���5���?��a����;��1���U4!�i�{��5������B�6q�K�;W����wG�:�����s.B�{��?����;�n��Xd��@M��z+(_k+��
KD���A�\����qB����? ������!�'0>�t�
c��m��Q�^��?�}������j�)PQ�K��E�;�����=��1��#���=��-����?�����I����q����?��z���]������������o���[�������x1������o���{�q�+i�*I�;%�n#j��+d�6I����x��^�	��P7����6B�~���g��
�&��t�������HGUL���wt�0c�"�WD��a�4z�B`O:�����7n��D�T���i��M��y�QK��)X���Z�R7X%XIl*������"��a	�J�oVR$X)U
+s���sY�qX	�����
1f�8��*?f�
�h����5a�4�J`����JP��J�bk�l��0���UV�.�36����+0TQc������
D�c��@�.����������Y�b6�*x��-Pm�K,�0�K++����+��*�ps�rd���+�]XY�����xm�.����K �����G�	��E�������u�$��*�hG2:V�~x)�l$�S�0��a-I<9�/�1�VJ���C�G9��g�C����oL�=,�����r�$�'������p���\���s�CE��r(BN�5K�C�$���>q���r���,��������G�>t��.�g���"2�������.b�Em����r��.���N�x�q}(|���}Ug�*��'����g�!�g}0��0�
��X�a8K����R�X�r��q��c+��_�w��V�S1jx�lZ���;�I
��4��.��GH��_p9$���3%U%�H��J�vS�������6��3J�����TU{J3UC�(��P�����n��gm�0��s
����N�
@Im��lP!��EM��j��2+�tH0�
P��i��g�ZP�\��o�B���iWn������	3�B%H�u�BE�u��6@Q��:e�x��_C�}��B�}�vP:������l� Q
(�JD,�r��6C�0��9���A�,b��:�������fy1�m����S�[7�98�IUU�T��%M��q�YG�,����	<L�6��:��
S	Sk�S���z������i��0��s��^�SZ
'W�,��I>	<L���f���vJ�/�v
���vJD�SV����S�0�.���`
fY�^L���a��0��f�F�lL=.1*�W��A;���a��2��r�/��m�<L!*x�:(r�m���0�1����m^���	S��&��N�~#��j�z*��S�)D\:�'�'��~����py�&��d�I�zNY2��_�
�Q?�j&�������	Q���� �x=J�e��������Bz�%�B���C���T�
Q�1Q�8��
/��B���%��IDYi��|.uB.Q�^C���}�3/�2McC!*��������^@�����k@����F�����P���LFb����2-%�:����F%D�^/QJ�^���(��m�A����� �D�m!J�-a6;���O�R�U|�ETL6Jf+����*D��m���r������(�fm�*/8lr����C�T�c$%�9��J�����U��ide$k��w��������<"\YJ��;�C�.p��s�e2����3�����e�I�C��?��-�
�8a
#����T�G� ��=\����}�5A()1 k���PZx�Qe9����5p�cY�}	&�@�`K���K~ T�����#o-�1j���j��0��.u�|�0���CB�GN�
�cc>��\�K��+�PUU7O���Kc3�6�)6�6����e��0X����}�$C�k��}���T�����i%Z������� T�@T,6�v�A�f:���&���G	��]�Z�7�����F�!Ij�&�&HP��@�S��z��D�P�X�*'~���E�m���_��I�xb�Zb+��{��Tla
�u�a��IM��-�Xx���E�����@��:��<b+��Nm_luk���9��Vl�������$�U���U�������lb��'���r�������:�
��Olu~�f@��w����3�-
OlmKl�Y<��l������s���Il�I�,��F�3�-
^������e%��'q�����hR���#~,
��I��Q�C@�)�:�#K�x�n���1�4�A���i<r����d�P���Cu�!�L��������J�D[�*����S���.C�w�H_C����Ve���0��C�����	F�|:d�q	�V�C���*�G�O��0%E|%�\����`)��}�>?c��Jep
�u�r�6����}%�T����	�=�F���r�b�Sm2�c�()����0����m��j��'�B\��0����k�I.)��
��90�����x8����PO�,�dp�=�����Ga��wIs4���`�fS���l��1�����*c��cK�to�C`�M����������8��|�+	F<�k0�o��X�6Q��r��B�����S�_U%�iZ�Y�L���0��B���[A��X��)�[�ZTA�v*� T��i@�o�z�mR�
8'e���.l���,����S��������%�J���������'0�\�HY T
��4]��g�� �����,�&P/�J�[ �vG�Z�"��L����5���Ua(��# ���<b��Z=�G�f�/����-@::����)���7]e(p�@8<3uF���R�����m
)�p��ky����'w�����������`
������:�FE�$&�����|�F SY����`�Id����Q8�Szq J�,h�b ��E���e
}b
$&�/��7�8:��PS�u2T&�
��8T��)L����8�Zw�a��8�1SA�N���Ik 1qh�����F�2:�Lc<eJ��-d�b$�4V�
�r� :�����0a��8�R�$&��������1���S1�:%�D�0o���@�M	J�Qg��0�Q�
�5�&(vE�4�����it�ph|��#����.�6G�l�|�P�U���a���^����UE�wL�f���2x�#�N<:���L!S!����l����?���b1���c���cw����m�qt���.����.�	\|q%���o?/\��MZ��N��\\�uO����	yL����E:�
�#c>p1p��h
o	�.�1�&'$�]��bp��.6�:���.�(kjt9�[��pp����y�p;2��W�[�+������LHyy�D��,����!��y����m
<&n������E�F�|�b����ld1}sqK��RQ��MWp�1��n�9nB^1p{���Sc#����Z�Bg��9�@a;/
q�A�#����$J)����C���x��t��s�zV���9�~#I{���+���A6/nIYu���Z�Y��3�fl�,�IX���[�.���r)Y����riY�B�l��Z���+��L�����i�0�%��|Y��U d���u�a����B�2�������-�IX��n�>�\|��le]	�d]�������S�)P�e��|\�����$����c�
�R~�������*�G����R��J��������t��1)�^_Wo~(�4
�@Msb�W��(��J0�N_sP�F�>Yd��rG�|���A��C�#K`!WV�����d���/�\�FF�����kD�8���
�3 ����]��<�������%�KX9�\�K��h<�e���.�B)uhR]�'��SBs.�ZF�S��6(b���;2A��b�V^�gt�D������n�1��z��]6��Nt<���rB��ms��*�G�|�zl&�����������&��Q��/�\K������L}N�=�H�L6WH����70B�
��������YsTR!��G&�{��n���O�a]�T.t��B��;q!����]�*��%\HjN��x��r
=&t[M�t����t�UA��2�x+BC�X��-�*������P�&HUBmm��e���_n����!�� 
����1�Z�y@G����bT���$[
��7�\�tK&"�F7���}�F�F��cs�������U�#����J��\����eT������r��E��DiZ\��n=&tm��
��������/���/���6��e������~��@sJ����u�-]�h{����W�#�`�m�N���������i��<���r���r���e��r����<To�����5��Mg�������W]�����C����Vx��|��XyA~|Z�)������,e*�x#N���(&�a8#�Z����M��b��3*RW�
�#c>va�"my�n^��ix�.$��������9DR�b��3�8���c�5���5-�nvwb,�4c�������\�W2����9W�]�$lLY��|>U���j���������������W$�9�5����;W)u�����b1xm����\�Q�q��,Y25;p[�����7j#"�6G]�:w\�
:��`�/�o��� ��pZ^�*/�eKsIu���!��������.�4����wq�k�����?�=��$�]��&�}��(����*���t���y�2j�3U�!F�]�����1_c�I�:@U��
ps5)&��X��4��*$��4�V��0q6�1����g*�C�#5��j�F�>��=w��#oK��1T�Yd���4���i� �����D�5
�o^c4�g��Vq�E��Im>A�����Nia�{uyn��������LoA��U8
�����{�G*��4�raqb�46��R"�Ze���*�X Ae�+h�V��e��>�0�2V���}�QC�Xe���epTp]�~R=��A�+B}8�U�j��Av�bg�kM*	�5�J��3�[�������Z�����VC�	5����Q$�0�4�>��=tp
u,j��~��Y�5lm=��L����"P���<��!�e�����0��3#�]:8��B�Z5k��t�i�x�Zj5T�P3m�y�A-��2P�F$�a���P��&��7�K���)��N��A�j���v�D$xZ<"����mm��;�������L�c 7k��n;nk�A�%�C-�s�0�UR��~$Ua���4G���`+pW$��HV%E�X}l=��s#�K3���C�a��{��n��M2�����&��Qu�*��<Se�;�l�/Sc����
�{ce�}��PN&�{o�)W�T�gp��L�Nt\�������L���^�3;��X��U�Te��^�"/CQ���TVk��p�BhZe�L��&�n^F��pf��He�m
�R�����T�]Iex^�7Se�DR�������DeXYL�mu^ex�F������L��*Nu����dt,�2���,QV��2�����'�B\\e���~�{*�0���[��`"i�|���w���o���TU�L!���hZe���^pU���T�kL����mg���������v<���
T�m�:��m.�'Zo�����	�L�qjSy�-LF�|��u�H^.x-8�*S�M�B��ST8H����)����`,#A��2t[eh�5D��E}�=r���e�/�H��D�f\�� u*�7��!3�����x�Qy��0m�!��2F�*������c�������[Y���/�e$�aD1+�S���h�G��Py����!6��������@G�+y;&��4��	�M����&��0�6;���l;&9�!l=G,����f��������A�Jx@D}4�1D%�O)V!dk�5F���6��I�_�H9��)O���'vLT$y�7����j����������gB<We�������2�c�����.������C�$>u��=��Ie�fxZe�!��^�������\�q�R|���OAe��1��y��2����k��q�*������O�k|Fe`���ez	�\��J���|��`�P���+L�Zsa���}���f�JT�a��ig�Sf��2�?��*��������?����)���d^���2��Rq�zq^\e��2��$%^F.���������?�2��r�P4��W������D���q�"�����k��s�B�����\�NWUtm��K���Y��[�T��o�)����AU�`Zc����GpI��j�e
��|��qCf����`A����];XL�/����������,��B��g���Y�,�zX0�`Q7_jY2<�����8X� _l,�����3��Y�7�c8����[jY��ZL���|�����,X�!�[X�������~	��xK'����R�Yt!X�pv�����`I_��?��.��lL���>$`].���$$Du��?������}�qd��:N�ad�Y�p	���"n�9K���3��3�������!������D���Lo�������.+{\��������O����	|9s�5���l�D1QF+=*g(,93-}�l��3�3%H��IL�'��X*g���Y@�
�yd����Lm��}���i-��r�3��c9��X2���*�g���'�
��Oj�\K�(�@�E�p��S��!�;�{7k����E����Z����Y�c��t�n,z8�VU��<a�j0]7��F�����b`QU�e��G�&�"XD���*��"LMK�)�h3���&�c&X|��E8���+���`�	,F�z�r,���U�]���e��>iYr{��e���]D�����v��Z,�����E���,��.��Y+
�"s)$J,�+.m���Ge�_���� X�%�h�����$t�X����������%�n��H\��b[���V��@@�'�B4������OXG0l7liV��=L<�J!Q���Sl&Z
�Y`Z�/���s��`1��eQ�H��p��f���X7v`e`�f�"f�[%���E����YK*���|�b579��`q�,�5f�-`YL���GY��`��`���}�twq�(\�7������`���������xY��2y��,��`12�p�,:�Y�
��'�R
��	����`�����i
u��`{Jha��SE���m����/-���"���8�+)s-����h8L
3n������#� Z��*�X����{K��z[���%�-��l��$����&��
DKPi�E0y x��n���h��h�f��G�o����>/V�n�eA�D�����n��I{�h)��TJ
dM�K��3mK{_������%��1$��-2���
��	�M�%d9�
i��=19��d�;B���f�M��l�A����E�x���RB�eY�����r|"�I��.po�t��}�����*�p��[�R*�����z�^�x����6{W\����E���&�W��#�A��p��k�.�,G�(�2d�L�5��Ny�GV_����Q.�����u#�+�,-ON_Y�lVS�3���rk"=��Y�5}d��V�r~�����R8�V@���d�����RM�r��$������������}d��k�C?cd�T,�����Y��Or��`�%����2������i;�8�ln�H�L=���F��~O�G�k�
	���gD";��Q�j�M&\1���e���"V�dRL�0�e��Nf������$����S�N���K�/N���u�|2�������Yv��|dR��'������4I;��,^o���e{�?�TIO�S�x2I	z�39�a��Y}F=��;tge�nzr�Lb68�Z��4\=��'�U�6=y����������sT��~OG����3���O������m�������A����xaZRtz'I��t�~8;G����MZ���Y�]Z��������T��</���,��]�$�����
�)C����!��8�|i��O�^�����F�H���!��d�nY��S����$g�������2+��<kdQ�>�Y�F���
�e]�7��7r5��Y��Li���N����Y#K��h{�����h�	D=���R��M������j�/�����7�}d�
"d����"cS�O><����Oe���U�������n�/#�*3��|�3�����z&d%xp����s�zc6�c�_Y6^�2�d5y���&6G0���Y	d������a{c&��H�~����A���]��cryd�&�pY��Y���lQ����uZl����6K���W���Y^_����`�E�7���.�,-X�jQ���B���F�u����)� +�7�����9���i��#T{vFd��J�c><��Vo���r�"�W��J�F	&�� 8hK7�l���?��V(]h���5�&�5�l��iY��^��-h��+�0��'Xtg�<X�j�a�eKQ����%�V��+ni�]�����
7����>�����j�i���[�s�T�)�jyV�`b����+pK��HV���	,h�1��(^zcvZ.�gY�+yk-��o)����TS$
A��-���1J�)��RoV�7z�2!�Q������Z2A���Z�OoA��
TU�2���Z���}�-&��{nb���<�K�c�/����?���b1���A�
=�F��v��5��>k8�Dff����(�{���p�[��28�����t2���q��m��m�pIo����b
������Q%�����	���7�[o�������DW�>��.[�H��c��:��J]xpL	���hJ		]�:w����H.9�c���ma�����F�"�,Ce<1�yp,wVs�����UW�9�>�z������6Q���S�����j��;XT5��L`���Db)����n�l"�:�������%�m���(��d�I����Y�r.�:�H�xW.�a�gn_�[$�n�Nm�H�#	r�����,3l�����|:��Y�M5Q��i�3�I�[h����9S�;P��c��i��z�����#An�9:��'	�K��*���L���k1�QHV����b����-����KHug�c/>U������(��#+����S����������>������}m�TJL��L�Lug���r�rDi����1�i�>07
9pTi���i��pT���5�F��$\g���h�y����[Mo�����m�37
YpLYm���x�X�P�|V����B���m78��L8�D�<�����i�-ZGYhUn�p)���#��}{�1���lg����]o���-����K���T
�����l��d�����I�1����{:��%��4��c�n��!��l��>�V#ii�t��e��Q�)��'��b����^)�XpTm8n��7���R�Sh����.��5�C�:�HYmZ��Q�X���b�3��i[��f����L8Z�����
p���=�����:�BB[-���X�����<�z��i6F���Gg� �@�ce���)8&����O�1�/��,�g���,�Gpl|�Y���G_;���u$
Qoo�Y���-������|N�����:��n$��17��ShU�����Ln��-�N�xlL�dZNv�C_u,'A���v^��������S"�z;�J%��K3G���X�G�I��D�k�y����1���[�po����g-Xk6�!m=�L�	�x�*W��[���d����.[Q�g��:o���(g���n�_U���e���-����������>��\��*�#>!�\�[��+��������(���aG�a��F�Q�B���G���h���f������x|�,P����,�p������=�����~���5�UV��,G�I���&I���6I���FI���VI�U�Cy��!5��HGj����m4��1S
(E��#����R��K����mz���k�$9���F�I��2K
��W��������j6:Y��<5@���K�O��|v��N�����@��U�Z13l��������E���z��euxyS�����s�^��ab�{�O@
5�~{�Z�:#��yo�]�R�������2;���������Vzs�>m5 �C�-5p������B�C)���2	w}0x	��x���v�@�{�	����d.�������1vZ��GQBE-�_t�6��"M�|#-UGQ)�o�����
E�(��q�#�a�uI���#[���*K�Hd����{tp�!l(DQdzt-�c�6��d%�c}��i(*�E�v�����P��]�i9n�\[���.���������H��h��9{t=�gS)�e�F?��`�.��t��L����5�?}So�T�d0��� �2������ �d��M[0e�[��a2����5��e�P5�7������ 5�jj!e����@�r��L
X2j��5X���,�Gj@�����c���l5����d�7���W�T�7`�p"L�[�e�	�Y������t�-�7�z�l�c& 1���t�njQ���>�6�
�\���^�a��z4��-��V�7�z����g�
��@4T���	HW��Io����l���Y\�xL�p�=�'�l*N����5����;\2<�e:*�7�4\��[p#y��Q]Sp��z�X650�Y�SpD[��F[��������!�8��������}�_�hv7��@.��-60��-�v?�Kwa��L5�c���E@�Q2!l�E�����y���'����*aX
���1M�����j (*UV�E��S�^s7�A
�Bo@�x��>��8Z��S-zx�������Jj�E�Z �:z��(0
����u�Wa#.�tqF{���T"l;��1w�PP���'���*0i��s).�U�w`:(�c(�����P�c���7�y��1;�����u��eP�7�@��aa�@��&�o{�[����}~�M�6zc�0:���A�H����|b
�3�d��Y�?0�H8�[�X�2	��i�sS���t_^�c^
�������]V
�M��Py��P�rX�o��������z@�������t*3���@q}���2Ck�8�������*����	"�^^���h\P)�Q��eX_v���=4D`�v1�X�j���}�A�.�N\�S�N���X���w�H�(����{��W���V���AD��A�����i���H�����!�Q&lX�K`��g�k������r
{!A�-b
A���1�X��u�8�`�gc��
k�������Nc�Zk�j�����Y>��MX�Z�,��a�J�&G�V��;j�V�X;�����|����|�<���'`�W��j7�5��Z5�-�Z��r�5��0�������m��|\������BL���[k �� ����[����5�M
�~�]s�9a1"xY����{����,M����c�G�3���y������i���\�uf���HHb�"Yi[y����oG	��rU�Iq��-v����W�c�*�Fl���c�� �Q�*#�����E�\��B�P���!X�Z���V�	X���"+l	hx�x�w�M���MG�qAA�����N-�~�*�f�kD�>
�k�����q����kS���k3��	"�e�����Mb��kc�?*=�6�6�(0C�Z Y��+�g4������)�V����f����\�nh���I�X�J���f5��N���{�F�f��n�y���a����o�����<K�
����/��o����U{m��R��c�^[v�����p�,��c��*�k��%��7\h��b��i�b%���I�'Q�ZU���!���57:�Z��}��� ��i�J@�l{��N��<@�v�05wM�!�r�.v�������@��h������m
4p�0t��Z���@L���������O�]M�!����A��m����H���B�4
�+5��}��h���hN��k���0��c���h(
���@s[0��g4
h1b	D���f��G���1��4�vH����fW�����i�@s��	�����r_��4�y��L�|�w�y�7ZH2jH5z���?h%����u��TD�}M�6�Jx}'0�
__�=?�Lx�^�]M�\�(�L�sL3�(w��{�u�V�\Z�0�o�1�6���|���Z�\� "��E�Y�Gn.�
�m����H�4�Zf!��'vX[�lk$�a��(��v��`��s���
{LXC��������-� y���kj+
���D��f~��[���b\��90X���b`�^O���X��%i���-,����{PQXi�5O����l~vX+b�d�|-
M?hk6������!�l��$y;��X[n�����m�WXi�5?��|���V>�����EQ�p�hW�9�H|d�Y���
�B�����UTx��z����Vjc�z�>6C�%s�f�����j��L�b�_u6�}������4�^�j$�i�T�^"�}���
��h��-�_���A�V�4��1�Y~��r�Q��{�&A�e�Y�8k��d���L�,������f�-�L3��ul�?e���D�PV�"�����
5��(�
"4�����C�������,��Eb���k���D���sf��
PV�����7 1
�4E��������P��Y�V-��q�B3/��(�st����Q�M����<�QW�Y�Dt��2	��(�� 8�����C�B���g�b1�xk��Bx���c���@������`y$1�Ax�	��@�_N1@=��q��
Y�hxJ`sB�����,���2��1m�f�����M�V�FMsUm8y��d�C?����t��R�9H���g�G������B��F����"<��]s����>u�
[
���LH�7�p��Z�&���Y|��
O[;j��96���@���lA������#�����o��;����6���VClf����r�i�0���|���1����mc�'&}�w� ��|��t�1�����L�x�����VClf����i�����g�&
x �	��nd]�!,�A�k�#6]vX	��^_��*vx����i%�b��eZ�W����as���o���4�t
}�B,� 6]g���7��e�
�Y�oZ�t}8�V`�66K�#��6#��p�����]^B�����������]y����s�R.���F��+rp����j���0��lb5���p5����8�e���g#��������j5�~�g�~T���(U�wn{�p�Y������i3����}�O��}���@�������X�T�u��}�XP�v�i�X�
W�����

��~5�8�(=(m�V�[�=�D����h�[/d��Vs�D�-��mG���[�?l1od-���[��k�8�k��8s�i��lW�4B�C�yH1#�������pd��y��S6��/����r�{�
���g9�o7��(US�<��2Ng�����c��; 4w�g�y�;`�0H��[;����_���
\
��i�M���Y��8/0�����B\��28qk�����9���D4��{k������!����8%��3S�m+�������������5�W��N'�
���#�>Bt��.(��N��@'�<rl�1C�Mj<��8�L}l����xp�����l�^,��_a����:H�~���#R!r�H h��
� �<�y7��������9�]*`�/����&����	����+7B-Ry� ��������N7�B'
�[3��F��D��Gc�.-E
�c��Q�R��v@S�m���j�t�2w���q��UK�Z���q�	>����*��j�c��Y���.��U��qq�����Hl�f�N�V�3���#��M`�������4�������8�i�[G�96a�u��� ��+��<����Q�j�����sM�[���t��o��f��������6����^�U�)<�5#�o�oJl5�ff
�=na/_<;l.��d�}~���L/h���L�����wK>FlF�&�%���fT�7i�}��LC
[
����D�iG;�Y����C]�7-3��mb�6-�U`���� �wclr!��A}��z�C<���aSa�66�QTq���n;8��eh��#c��@�s�k��!;�n�`�@�TM
�-��]�H�a�����:�]��xV9�>�~i�
	]�2��Bm�Q����A
[M��Y�l�
w@�{6��84���I�b�mt"���k��un��tF��:�V�n\�9�1���e�|O�$��{*��A������*?
R����3�~9�F�e�rh>�)�P-%�a�U��<���UA�)��C$?8m]��f	����1n��_p�l�~6�f`#T8���[�M����"�[kC�<�mH@�)6s�!��M+���\?8m��c'I��^��8NRx�r��p���`��gU Kr���o�4	�^�}�;�����\Rl�2�P-c�IE
L�	��,b�u�2VC�bX��C�[3u�\
O��T��S�|�-p���`�t�R���{G�9��g��Wi��</�0����C���W�E�W���(���|+D��������M�o������:�7�WP�e3�v�����������v�Q���Wf�:A���K������������O�i����u����s|84����2�Cq����?������aX?��_�\�C�/k?�"5���;�r�������������c1���Au��
���.���g
K���7�
��v��W���{���y|+D�������N�E���n)�����1S�`5S��Z��}�5�Q���o>B�i��*��!w�q��!�����������}������D��|���D����J�OK	8�*ci���rm���Wh����]v��+$YP������!�Da�KXSX-�! 4��0B$� �g#� ���_��m����(�&����p�h{d�[
��5
/l�!\�C����	��p����9~�Mx@x�CX^��h	�R���@�ck�9�X�:�X�`��!<�5���l�-�
��G���n����O�N������������B-a�C���[eo*-�w�(<���p��=�`��	��x���:����!����DTV�C�7mbS��������,-X�l�T?W�!�!�P8�Rq3�Y��F3Hm1��������T�����n����*0��~`�8����F\F��Z���&���Rhe�����]��Z�&�D��!����kt��9�=>a�F��������
�n\i�u����@'�,�<m/�"u������#"m1vw���l�n�C�b�M�n���-�#��y��Go���^�"�5�7�.+4�8���V�\q�c�y����sB��l�6	���>�
[f�.ZG�����R 7���..�����2>ms� ��h	����lNQ��[�]���}pq�~E~���
Q�]����<��m�<��
��!���F�d}D�DGQdF����ZO�qWx�e?W�����n�/3���k"�TM
D#m��Q���t��R�9`W�R������{v��4��vZ��)�33Y�#��N���l�.!,��=R��������O~���Zb�l#����f�$�t��/��X�V��
�M���z{[n�'B4���6K�n�k�]DC
���y���w��}�~�� ��W#~��`���:�����;]���������l���qH?�5B��p@c��C�j����f1��>�����~A��v�k=tL��_����mf�n���W1	��y��f���L���h�����&��Z��2����r�l���h��
�
H/�p@��f�s��	��(���+\��ft��U�����.Ab��	�VV�+���t�Pp����?����yc�;�C8�(@��&�R���n;��s�m���"��7��I�
���W;4�V�A|��}���n�h�~Xq�:mj�v+���9vn&�@�}'��A4B��"���C�}���� ����_}�Z=[�hx�����kE�������j�o������d���m�� �������q`�n�G��!@��q�@�l-�����P��G��owR������,��M/�i�5�M@:&7����t����k0�v��AX��i������IW�xc�k�{UB�V���h�+�:w�
�B��-��v��������e��,<nf�n��������\������P��p�q'�t��RD�2���[��'���~�2iH^�oy������?��+�?��;�?��K�?��[g}�p�/�+�ht�[q��]��~�$y���]��l����	�8���3=��9~
������N�[C���t�F�r�5�,Z��JPu�y�����OW%��
�4�+y�-,���}�gt��"b��+>�7��]	x��4 {�j?��X8(8�Z���W�g�!�6v�~�")�FW���t%�U��F�;�^��]	|��X�a�f��,���b5������t�Dt�r\G��V�!��pz�tEQ�k!n������H���=�����a(#�W��1	]����?���6����*WK^���pr�
���OW%��J`����T��w,�q�����c�%Y� 4�1�A����W��wH����V���������V����!��;
.r������i!���Gc��F���j�F���8l#����OB;�pL��r�d�h�n�n�����X�Fj��J@�����}���h�]������l
���E1���<��@�c�~������6������?:W=_@�W��;4����1����W�Oo[h�)�h\~��Zm\��Q��>���x����;4�������_k���F������T+�d�������y�������N��\�3}o��k��:�������w�q_\?�Wd�a#��r55�������'w
/�MGg,w@��WY�=Z�N�_#��])C8;/���N�h�Nn}�p���������z6AW"��sH9���f^������e����q~�J�b���6]�*v�y��6.#~�tEQ��&��._���v����.A�����Jh��E"B�7C����5�����o���_����+���@^	+����m���tEQ=�G}/����q�=��ODW��2=�I@'q����5�
N��n���GIWH�rI��k_��UWcc������SH�h���w�/��`_���i�Z�F� '0���'dc�zP��\��8�Q$���_0�t���k#����+�����z�S>5M��;�J��LW|+�i�������E�<=����7�/�+�jV���N�G����8-�����p�P���!�3v\3$Y����]�!����v��"(�z�J�&�R�>��T��]�}3v����L����k���O!��5��~<]!���B�W����B^�!o����OW%����!]���~�m�]��l��8�2('!qa����]��D��F�u��~8]�!�D���MW|���4y���.�+�ht����n��~��U��]	bW�p
���}�k�+t��<�	��2�5�t��	����]q�Cn��S��� A	t=����S��|�wt���1=(��X����[��MzPH����Y�'���
.9u���1����+K��%�����m���"Z���;�l])�\9��W.��j�Dfd���c9���J�������a��u��q��f��AX�]��i���������!HZ���������<����,�+��t����W�w������;��4tw��5��_��L
jLW�!�nD���(�
�<8-�8�MW|���)�y������Er����O��o�!2���F`	~vt�d^qC6C�]D�l���\^��%�D��5�Gm]��:!-ye��AWB��n�c�����+���J}��==���~GWV>��,�$�����ey��z6�2^�-n��%�J~�^y��|��qvQt����@;��M�����t�mw�����rp���3�.��|��L%]a=(�H����+n�m"�g��&���B�������3��z�G�sA����j�u`��������_z�q�D\�l����'��|���am�)��<�u	��:��IZ�o��G��X�����X�������Q��7]~�|W�7�$��v��!��N��9�d�4��N��UG�1|~1������Z&������|�^w�������������<y������&F�����}�����:������_���<M�_�����oM��h�_�~x���Q_�0���0�{���z��tG{g��������������/M>Lf���o������������Gm��\�I�I���=�����c*�88:�L)�����>���{1L������^�������G��������^o�����L���}��s������{�����,E���)�'���l<��$)%N���,�u���5��OzC� �u/����,�IF3z�L������.~�����a�"�Z����Mwv�������,�W���p���.�!���x~u����nJ�$ruh%�R��M����hL��zT�p�E-��m��|��&�{I�bl{���yI����>E��fS��{%�Q��j<)��b?8r�:g����'��N��C9<<:��G�$�'}����������98�fv�|;���W���s��TM�7��������M������d��3M���`� ����B"��i��7���������ug�qj�~?����T�'-�~<��-���;�c����X�7������>��#+"���m�p]RRc�������l����m�~x��00P3�zqmB����;���h���]���aB��o\��������tL�h�����A_f{K[Y�Z4���A����W����4G���vB��I�~rh0}�u*��W��y4�n����df<y>��<7S��[�>�(��
�U����{�M#��n�o��H��3�?�62�3c�,���J�e�/WB�.���1�^/��hjH�g���K�_x��z��*�w���0;J
b���GT���^N��i����3Rw�;�G}�8���K�j~?yN���v
��� A����dv�BQ��gP�������������;���?�'�b]~l��u�.�+����v����0c?�g��]Z^M[]���x�������j��������u��%oE��W�V�l�-���e���%����^��L�W�V�<�C��.-���.��M��g���]ZVM;]�i�����giy5mu���s�K��]ZVM[]rc���K�WwiY5mu��L��2�*oy-m�q���N���K��i�K!��o��-E+���������'u 	{U��U�N�����M��W�V�R�����U�ay5mu��M��Z���j��R�B�}����K��i�K�e��=��J�����������w��"��i�K>��W�X�.���.9|'����J���%���S[�Z+�i�KN���[�yUT�V��
z_����UQM[]�}��ZV�*�i�K�70]�e�������{�}o�u�rQQM;]r_��'�k�_������%����.�U2^E5mu�G ��e��j�����F ��.�U�������Z��=�������%���������v��cE�[�[������%W��j�W�V�|���k�tV����i�K�(x��x�*}iy5mu�������+giy5mu��/�i�/-���.����.��2��j������}�.����j���� ��[Wi���i�K�n�a�v��j��R���M���9e�������Xf���K���������o��+j��CE5mu��M��
��j��R�����uy5mu�&���w����W�V�����VwiY5mu�x�{o�hu��U��.�h7��G���&�|8K�^�u�u�}���|�-�������C<28�a���f�G{z��D���������>�P_H���SH�?.BjF�D�p]���B��9���qN+.$���
��H�(��B�c���D9�8.$zF`�Q`�G��9��8�(6�*$:���D�a)��D7*���"�*�p\���T��CU�R]��UJ����b��T+�G��pf)$:F�~)�5"����bE�&!����o��D4�G�H�����a!�E�qd��(�
�>O�+��pB�.��Z!��R]��x�T>R9����sV7�p���NPH����b��D��
����U��w}-�W�����b�c��L+��Fh#�^1�f��L+�������|�X�0b�=����C����.��V)/c���T��m�+����<�V1�j�p�K1�E����T�K�Jy}.�.��%�
�b*��%+%�"����4u�X^1����]h��k�����EI�8,%S=��YL�D��xb�Pf�/d��Zh5��>�N)�V'B
Z�d������5���%c��r���`f��/��bH4��E�d�E9b����z2�5�_��dG&��d���^��kG�$�^)�����N�%L^\��4����x������������m���q�-��Bx�x������q��.'�NXL�8�&����d�0;b�x������J�9�q)�j�<Z*n)=u��+��P+m�U)����HKq�R��"EQ%)�P�i)DDi��,��1=���,������"P������u�)N�'�6�\j)h��Z
�9��B�m��R|�Dy�1�.F�R�$>yZ���L3}Z
F#��<<���b=�[mz7���$k������)�����4*j!�	\Dgk)<$�T1BZ_Mc6��F���`=�d1��Z�Rn(���F���6�������V��F���������������@O�4�a���4*r,�J��q���F���Vg�T�5�/$����J9����6K�c&�)�_X���I;_K��J'k����+����������z�3�
�<����|��L�P�h��S���\h�o9z���
Ih������.���N*�*��R3����)���Rs�4n����]H�����JBS���H����S�Q�i�v�$��9����=��������'a~C;��0�a\���H��y~c�C����D����dS.����$j?�'KO�Q��
��_�F1����I�� ��$�_% +K�\��<����*����>G<��Bz`��4t���7�P��p��T�����TU��%�L�s��\*��������i�v�����_��zt=�����a)�/��e��@�x���d��t7�`23	��[J�8�-��W���P%��6(&��B���b2,��N1��d�-�&���I�.tG����t���'c�>i{�S.��Fz������.e*n��h�I���$,%��b�kG�d�TL���I��q�����r������
�����1�;���qr\���MaV�? ���b�C*�O�\)�����b:�K�����
�[�VQHj���������d�'I�~q����0�%�L�$���
�yE��S�,�/)��%-�� �=�N)��f���t�vL��0�:��.J/�6������J/�0/���b_�����ae% ����fY���E�"�[��(����/b���UF'�(���OSm��`[���t��l�WJ�U�:u��f�
G�cZ�*�_�{e�JD��HS�b+��E�:$3���7�*�q�(H�n�����0.�@��� *��a#R��	e�p�W��J��&������f�-�B��#���K/��a������
�o���x�~�I�QX.�S_�_���UTx!��%��J/��m�B�04VqPz�6
b�:���u�vq��N����[�0�i�n���]�]z�e���^�������%��+�q����o������\\Q~�gqu��&Ei��Ll�n�Ou���r������4��������x���#�������������{�q~=H97
H�,5��"!��	 ���l��Mb�4�v������S�'/�p��	���_�o��Y�N�������1O��	o�D�.��l:��{I��6�]SyO�������#��m����c�f2Lf	�C�r�4�����p8����,nej ���1����I���7G�x�L9(\R���x>3�d��z��Hp\��i2�����(��\v������3>%G�0x^tUY����t~y9�
��1�5�����P��T��xn v��=w��O��5K�*c�a$E)p>���QiS�^2��q4�N����#���0��qr�U'���(��34����"��!g��~M�BQ>@Y�ZP���;��������H./������y2���fC��'�}�py��������A�D����_�P;���D���R/�k^n7���;�O�R1p�Pz7���L�����I�X��'F���V[|Xa�a:�2����0��u*����
p�Y��ye�%O	3�z�p|����/��%�R�Zb�	1n��O�M\���Cn�h���f<�5���,}a���[B�~�b�gY�)��z���Eb�Px��FY�������F3$W���$W��b����,�-J��U�
a
��ZR�9J�q���kn��x>]�}����������������C
���������'��m�I&��N��4�����t�?eT\N�7�*���AeIx�?�}���������GMV����V<��k�|oD���!3u�����q��>������8'Hxw4�k=��{�CLM�J��T��_���/��|z�������k�B|,�TQ����W�2��7���WYM��2���s���wZ�D�;�A��8
�����W���������M�Z�M�Y��X�����L4O���5Z�(^x��a�z��x�������'%m���S�r�L�x%��vdX\�9
Gz����������G9�Y��">$���D�Q��=J>���rJ�2��j���q��0s&��"���/9x����(�8}E������$�R��@��e�T�{�����wD`��"�0#L����e�1zEW�1	W����u����~Ya�������r��)�f�!�X��������EQ���K�����9H3����k�&��/�#u�S}a���>W�y-rtd����K�G,0&pG&��7y�K�����!q�}J�|Ih��Y�.2��P
�����c"5�����RP�P���C��Tc"���b���O�m����=��=�Kf��_�?oP��T.I����S���������R]��W��~o�D�^�=������'-����h�Hy�����yA��@I��������G����xUd� �E��
�M�
�����HB�t���x��$���V�$B�H�>J��G�P���Gn�>Ib�Y�d����V�=;{������{2�HH@���2��t�\�*�������4�`�De��s����e��8��Hp�R�1�+:2��N��i:���I�s�+S�Y�T{��z;.|@I�-�ZT.��T���Z���w��^�;~s�������9>{����?���J�v�O�����F�O^4~������������&�f��-�����
���C%�b*�O�����r�Y�K��X�J�c�u2���$t��z���)p�Tj�j�Z^>�{I��g���po�df����/�H.RK���B�~�bN:0O��D�~c�B�����VGz�_%4�W��g�[�?b�X����U1|D%�x��d�H��)�D�������k�b5AP�a 9v�v39��*'�&�6�$��|ZG�7^w�f���H���(���Q���x���GTu*MDx�5!�dz�W��*��K �wnh�^�^v?����nz����i	h�mr��#��M��^���1���1������t�*�S��w�1Y����=���e��rd��-+�d���B;��>Q��J�^b�iJ���	W������X	7�_��� J0A>�`7��`0�L9��5I2���4&�;�mh/TN������_���
F���"j��8���Y���8|o��\t���zZ@��vP�)]!f��_��2�A�;C�KT�0*��L
��z��H�
���NZ��K�H�Z��Q#��dh|JR�{�������+��U�FV����k�~c��_7��d�e��/���?ie#UQL�-��3+�b�?e�{��dg���)l�����i��z�I��i��2��.��(P��G�72�1�}�	b�cqq���i6T�M��a�*o�L���S���t�B�0[�T�������<�O�|4�����1�>/�QUp��{`E�Za��9g�h���|�4��s��v^aujm�tz;�]O���<5~K��_c9J4v�4Je[���Fr�<��*����}����_�o�g�k� E�X���>����*H�L��*��*���`^�����y#@�1�}P*����tQ��e�Y/����>��\`�\K;=��f���,���8y9����gh?�o�t�(	WJ�p��������Qw�^�g��F?���5C�"u��+���L����
��;�A���}:9?>z����78���\�����o�z	���%Ln��C��lC
*5��%P��!�2F�M{���vO!�]F.R�<:���]{y�����e���GSb_��1����x:���\N2Xb2<fM[��!V�=S�Y�g�#:~3�$@������������S���g�c�0�wKB�"��:�:��K�d<SR�Nq���e�$����E��Tw��)/m,����/������5U��J�A)��!�]aVO�*_c���������>�CT���)G=�+��O������
f=��%I����SCK���#���b�\
��%��W�t����L6���$��1J�A�e9L��F���
|>d�P��kwR��x��<�"��>�O�*/h�5Gf�\�'�=O��"����`�_�
&m�:FeI��r���cIS���+�JC&`$Ih�>HIS[2�����o�����]+=�������mv��U�vK3��2�M_�l��tt��PVq?�#�|�
��-��,�����=( ZRd��C����%�i��")x�OVr�tU�_��c�0��aCx��S��&�l'(W�q���	O���P`��e��7&CRg��h��0�b���2~C��=E����%h32�~�x4�����`�g�3�.��7���r��,�D��T���Fo��+��=.�����zK:g��3�}�������&e_����}�W(Q����@-��L���L�f�@c���:3#sboTuv�9���5G�;F�j��kw4�;W��t��?J��"�,#
�#����V)9�@w��b��:��'��0����NN����&�YI�Yab�d�2r$�� �[�a��38������[����<�xu�Fh��E������i�oB"���lII�9��OY�������N������S��	I����_�,7X'o��p���
!@���o�&��>|b-��I�\bw��+�����������U�6(�_[�r�>���}s�\bA���������
��
�
g�����Ta�W�O�4�JN76���p=��!���P�S������}����z����x"]gL��E2���}�.�����������Y�!B�K6�O[�8��F�PC�Of��y&�����zVQ�Ve'�g�2���[��V��Fc;]�b�l�@X`���O%w���ZP�N3bH����� �����
�n��H]��S�`
C�����i�N|�zac�|�K@,]n��*���j4�
U����@\���[����c�ai���:�'l�<�K���Y�z<N�{Y���<Vx!?�b��n��k*�2���,�Y��������������J���*|��|����V�}��+"��r$
�\5�
�������.��[��,��dp@1�1���s^�E��2��'x�s&�*��

i���������b��'�;�nA&��� ��E��?�5���W�Tyx@�C��kl���9^�JU*�� !)���`wj�%����wl&����f6
�l��@	�D"*(���%�B��L{����Z��Vx�����1���������X�bpi,��lWK�A�.�u^p����9[�j;3.�L�I��]zr�����%�Q�t��d.������3�G��]`i���R������M�	��CL
�}4G�;�|�����M���WM�EM�N�q���f�s�j}pN�pU�*�W�����e� 	����R�.�VT��#�������P�:�Q���m���7\��k��kA����3�|3�[�)�a�%le��lR�t���d�����"#����O7�?�?y�����!���;�p_|��GX�:l��:IX���U���n��3���M2�J�/�����ZL����T���cV$E	����~�}�bN�������t���S�����
�����h	
��I15�[6@�K�m)��
xM�siL\�����j<��f�A��:KJ����C�8��A?�VXp���DN����{�YW^=���u	B;-�)�-������M�SO~�B\�ll� ������;��d���L��;;���G��Gb�g<��p�?������ �G��oJM�+ka���+1m4C�	>O�<��z���7�{����[����T�Q���Q�0����Im0�����W�(h�"��[�cJ��ZA|S�|��������~�~�
��Ho����|$=����AF�+k^��~74K+G~����AWU^%Hfi�m��d��=�?���EB�J�1��������D�Wy����"�O��a��Q9m���l�J%�����-��_�\ i�n�{������cZY��/��o7���jx�>�UK���"�l����
zin��y�M(l��~M�[�I��!��|FB�\�)�����������43/I���a����b�"���6�Y0/��2�_�5o	��+2��2;u�����p��tOX���A�'Q/���&)�@?�^���A���F.����y:��wm��>|:'!��W�~x��5��o|������p�6x���`�\V�Y?����}uN�+����
-%��O�^��
<���D�����He!�i�AcMC~���Rl��+h���3f3�,x��:�M���%�
���v��n�
���c���iNfB��Y����B�?cT v�M��he?kSA%�OS��;�����%��z8z!gN?�I	�XQ�*Bd�����]6���{����>�L;)�����:�L%y����?zzs���y���wO�Ln�~V���L�{M���I�������_�����U�)�"���J9�*��~@�Y�5N�qS��4g�}�����3���A�2�lV]���m2t�M�s��������%��U���w��'u}2���&Y��������'�L���)�	� �<u*>�:&�g�+���_��&�"�����;4�
�����Z�F�7��o������>���UW��2)De6�x�������
6$��&"U�V�C"crc�1�;J�"~	=�WR�S�0������p���8t~jd��!e
g���/9!�$,��! �������4���B�Qd������V��&�y�w�%�?�m�6�r�1�Q��@h�(=I0���������O�X�9F���\�����^�o���S������,���� wZ7T;�b9�7�M�ew�{&D�F���}���7V�i8���Y����K�E�,����Hu�.��i�!����f�dw��d0����C�_�7�#�����f�)�vX���@���������������F���l�/��&���|������Hx�2�Qu��v��$�c�;i"���<AX���p^��X��
iK�5�j��xI^�p��c��goO��l���^k����cbcg��MU�X��}/�!>�����%s(�W�#z���$�;�I,C�3�B�#=.�b��i"��e/Y*�'���x-�
�O;r�O��-8I�E�#~��"P��dBB��kZ}����_0&G�ZHO.��.���"KD�&�b�
�/�x� q��P��.r&�������BxV2��p�c�\������N���:����+��:�v�
������zN�q����< `)��^%|�9��2x�kY��Y�7�F��1p�N{��{���t�/>^��f|NC��3�/�~s�����*�<��B�����s�8����P2��5���A�
;u���o4�*�V�����������q��X�]��DK���T���p��`Kc'�#�r��^��Z�����`:�w�%�3-�g�+-�������x"B�8|��<?[_��S���J>�������O��jH��{_:b'3��������@i��Y���T����F-�$�Q;����+
W���#]�'�I���#�B�(�:�v3��2�����<���2�k�1�t��@�i�q"��^�4h�L��kH���.�������X\�6�_De�����$2�p��ojJ`zv�U;�)F���x�K��h��*{�4�_�Q��
�Y��|1;O��q�h��1N����x�<�8�^���V��?}x}|vf�����`��Uz��c�t��R�Q��`�q�$�f2��
#��M��!u�F����������VB��
��ly��J��3;!�f?:)w
�9��|�a�������T�ys)#e�X�r��|Bi��|7�%����P�@N�
����xa��d���h�T�0�����
itAOA:�Yh�.I1�/���������������z�����&�g�h�8��J�E�f�Q�b��RX8	#�|r�X]c��WD`�?�[��Z��E�e��H|��tI��������������-����v�M��^I����eEF)�
65���b�"o�k��$
*l������=�����?���$X�'���MS��X�S��y)1�)t��ZY��|���x��kYuj@*�W6~�P\N����W��Qpd4u���>e�I2/��rx����hms�x�.V�-�8���K���X*����^���N�G]�������eB�<U�F���5Z����*�WZ�Vz;7OO���k�7�7�y����%y�,��-d���^����Z_06��T��	|�;|���'���C���cG��ur������vh5���`aG�^��n�%�����u����B.�N��w
-����T?��1����)��g{�
����@��+H5Q:;P�?���`eV��s�������s�2k�fT��a�Y�a����Y�?X�2u������#�x*����7�|	I=�����K��P��P�{�y�n^�#��$VV�2�v�e�,i��5�-�&�j�~�/�D9{Es��%�����r��tw��L�\������*���
q���b���C�_����fG�J��;�:V�4�8wJ�W�2>+@(c,�:�~:��p1��t�mDDY����uH`84n�	����w�>�j��z��^\L����I�@�g�s�g�8!�q�����(`��q=�0�7!�%"��LD�j��M1D�a����l8���8K+bd�ah=RU]����sGE���E�]�! �d��HT���4N<������p�E��+at�T��!k��|������v��R������j�}���_I�����k�?��n��N��Y�Y�Qf���(��:	@�������?�����'��t]��7���� u>����et �Q��h��'��w��!�\��L�����l������0�in��R�(~�NE�+~�y���r���iQ���e��:���/�����������p�B�*�q�tp����"6�2&p����Wg9���}�s���":���wK�����9��p����X���P�W�9���o���]A�Of�$����`*�����?n��)*g�"��~x����W�_�=~G�n�m��*��S6�/����������V
-H���U����xbK��G��'�oh(Y�w7'#����;j��b��e"u�b�������OgP~������_>|zGJ�����O~;9=9�{K����O����"r*��L��D��C%�b�2"��?�c�`�D��?������L�tY�#v�Y����w�N]y��q!�-�]�,��>o=u����C��;���c�!��_Ls>g�\qO�kf���-�R������G����O�>|2~��p��{t��"�,�!(�_ ��=P)�3���^C��:"��
�6Mr&/=7�7g�7y�
0���Z��ov���X�ps��]��2����6-9d��`q����|��25���m��������[q���������u�N���J��LSE]l6������^��:.%���$��=p/\�3WP�a�+������
���%L����\�#&����b��'F=��$�U�y�4�B*��V���'�| .���eB�'.��b�����.;�����k���%�p�����-kg}>;�����;~c|����M���|�\��j3x�����8.��=�g�����{v���������N
run.shapplication/x-shellscript; name=run.shDownload
#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#13)
Re: Gather performance analysis

On Wed, Sep 8, 2021 at 3:28 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

On 9/8/21 9:40 AM, Dilip Kumar wrote:

Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.

IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.

It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.

Yeah, I completely agree that those binaries should not be used for
benchmarking and patch comparison and I never used it for that purpose. I
was also making the same point that with debug binaries sometimes we get
some valuable insight during profiling.

As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.

Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.

The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.

Thanks for the benchmarking.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#14)
Re: Gather performance analysis

On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on various suggestions, I have some more experiments with the patch.

1) I have measured the cache misses count and I see a ~20% reduction
in cache misses with the patch (updating shared memory counter only
after we written certain amount of data).
command: perf stat -e
cycles,instructions,cache-references,cache-misses -p <receiver-pid>
Head:
13,918,480,258 cycles
21,082,968,730 instructions # 1.51 insn per
cycle
13,206,426 cache-references
12,432,402 cache-misses # 94.139 % of all
cache refs

Patch:
14,119,691,844 cycles
29,497,239,984 instructions # 2.09 insn per
cycle
4,245,819 cache-references
3,085,047 cache-misses # 72.661 % of all cache refs

I have taken multiple samples with different execution times, and I
can see the cache-misses with the patch is 72-74% whereas without the
patch it is 92-94%. So as expected these results clearly showing we
are saving a lot by avoiding cache misses.

2) As pointed by Tomas, I have tried different test cases, where this
patch can regress the performance

CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set enable_gathermerge=off;
Query: select * from t1 where a < 100000 order by a;

Plan:
Sort (cost=1714422.10..1714645.24 rows=89258 width=15)
-> Gather (cost=1000.00..1707082.55 rows=89258 width=15)
-> Parallel Seq Scan on t1 (cost=0.00..1706082.55
rows=22314 width=15)
Filter: (a < 100000)

So the idea is, that without a patch we should immediately get the
tuple to the sort node whereas with a patch there would be some delay
before we send the tuple to the gather node as we are batching. With
this also, I did not notice any consistent regression with the patch,
however, with explain analyze I have noticed 2-3 % drop with the
patch.

3. I tried some other optimizations, pointed by Andres,
a) Separating read-only and read-write data in shm_mq and also moving
some fields out of shm_mq

struct shm_mq (after change)
{
/* mostly read-only field*/

PGPROC *mq_receiver;
PGPROC *mq_sender;
bool mq_detached;
slock_t mq_mutex;

/* read-write fields*/
pg_atomic_uint64 mq_bytes_read;
pg_atomic_uint64 mq_bytes_written;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle.

I did not see any extra improvement with this idea.

4. Another thought about changing the "mq_ring_size" to a mask
- I think this could improve something, but currently, "mq_ring_size"
is not the 2's power value so we can not convert this to a mask
directly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#16Dilip Kumar
dilipbalaut@gmail.com
In reply to: Zhihong Yu (#6)
1 attachment(s)
Re: Gather performance analysis

On Sat, Aug 28, 2021 at 5:04 PM Zhihong Yu <zyu@yugabyte.com> wrote:

* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete

+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;

I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.

Moved it after mqh_consume_pending and moved comment as well in the
correct order.

There was a typo in suggested code above. It should be:

+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload
From b111756f7136f3e0065a089a8616ad77b9963935 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 4 Aug 2021 16:51:01 +0530
Subject: [PATCH v2] Optimize parallel tuple send (shm_mq_send_bytes)

Do not update shm_mq's mq_bytes_written until we have written
an amount of data greater than 1/4th of the ring size.  This
will prevent frequent CPU cache misses, and it will also avoid
frequent SetLatch() calls, which are quite expensive.
---
 src/backend/executor/tqueue.c         |  2 +-
 src/backend/libpq/pqmq.c              |  7 +++-
 src/backend/storage/ipc/shm_mq.c      | 64 +++++++++++++++++++++++++++++------
 src/include/storage/shm_mq.h          |  8 +++--
 src/test/modules/test_shm_mq/test.c   |  7 ++--
 src/test/modules/test_shm_mq/worker.c |  2 +-
 6 files changed, 71 insertions(+), 19 deletions(-)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index 7af9fbe..eb0cbd7 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@ -60,7 +60,7 @@ tqueueReceiveSlot(TupleTableSlot *slot, DestReceiver *self)
 
 	/* Send the tuple itself. */
 	tuple = ExecFetchSlotMinimalTuple(slot, &should_free);
-	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false);
+	result = shm_mq_send(tqueue->queue, tuple->t_len, tuple, false, false);
 
 	if (should_free)
 		pfree(tuple);
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index d1a1f47..846494b 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -154,7 +154,12 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 
 	for (;;)
 	{
-		result = shm_mq_sendv(pq_mq_handle, iov, 2, true);
+		/*
+		 * Immediately notify the receiver by passing force_flush as true so
+		 * that the shared memory value is updated before we send the parallel
+		 * message signal right after this.
+		 */
+		result = shm_mq_sendv(pq_mq_handle, iov, 2, true, true);
 
 		if (pq_mq_parallel_leader_pid != 0)
 			SendProcSignal(pq_mq_parallel_leader_pid,
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 91a7093..4493fc1 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -109,6 +109,12 @@ struct shm_mq
  * locally by copying the chunks into a backend-local buffer.  mqh_buffer is
  * the buffer, and mqh_buflen is the number of bytes allocated for it.
  *
+ * mqh_send_pending, is number of bytes that is written to the queue but not
+ * yet updated in the shared memory.  We will not update it until the written
+ * data is 1/4th of the ring size or the tuple queue is full.  This will
+ * prevent frequent CPU cache misses, and it will also avoid frequent
+ * SetLatch() calls, which are quite expensive.
+ *
  * mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
  * are used to track the state of non-blocking operations.  When the caller
  * attempts a non-blocking operation that returns SHM_MQ_WOULD_BLOCK, they
@@ -137,6 +143,7 @@ struct shm_mq_handle
 	char	   *mqh_buffer;
 	Size		mqh_buflen;
 	Size		mqh_consume_pending;
+	Size		mqh_send_pending;
 	Size		mqh_partial_bytes;
 	Size		mqh_expected_bytes;
 	bool		mqh_length_word_complete;
@@ -292,6 +299,7 @@ shm_mq_attach(shm_mq *mq, dsm_segment *seg, BackgroundWorkerHandle *handle)
 	mqh->mqh_buffer = NULL;
 	mqh->mqh_buflen = 0;
 	mqh->mqh_consume_pending = 0;
+	mqh->mqh_send_pending = 0;
 	mqh->mqh_partial_bytes = 0;
 	mqh->mqh_expected_bytes = 0;
 	mqh->mqh_length_word_complete = false;
@@ -317,16 +325,22 @@ shm_mq_set_handle(shm_mq_handle *mqh, BackgroundWorkerHandle *handle)
 
 /*
  * Write a message into a shared message queue.
+ *
+ * When force_flush = true, we immediately update the shm_mq's mq_bytes_written
+ * and notify the receiver if it is already attached.  Otherwise, we don't
+ * update it until we have written an amount of data greater than 1/4th of the
+ * ring size.
  */
 shm_mq_result
-shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
+shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait,
+			bool force_flush)
 {
 	shm_mq_iovec iov;
 
 	iov.data = data;
 	iov.len = nbytes;
 
-	return shm_mq_sendv(mqh, &iov, 1, nowait);
+	return shm_mq_sendv(mqh, &iov, 1, nowait, force_flush);
 }
 
 /*
@@ -343,9 +357,12 @@ shm_mq_send(shm_mq_handle *mqh, Size nbytes, const void *data, bool nowait)
  * arguments, each time the process latch is set.  (Once begun, the sending
  * of a message cannot be aborted except by detaching from the queue; changing
  * the length or payload will corrupt the queue.)
+ *
+ * For force_flush, refer comments atop shm_mq_send interface.
  */
 shm_mq_result
-shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
+shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait,
+			 bool force_flush)
 {
 	shm_mq_result res;
 	shm_mq	   *mq = mqh->mqh_queue;
@@ -518,8 +535,18 @@ shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov, int iovcnt, bool nowait)
 		mqh->mqh_counterparty_attached = true;
 	}
 
-	/* Notify receiver of the newly-written data, and return. */
-	SetLatch(&receiver->procLatch);
+	/*
+	 * If the caller has requested force flush or we have written more than 1/4
+	 * of the ring size, mark it as written in shared memory and notify the
+	 * receiver.  For more detail refer comments atop shm_mq_handle structure.
+	 */
+	if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
+	{
+		shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+		SetLatch(&receiver->procLatch);
+		mqh->mqh_send_pending = 0;
+	}
+
 	return SHM_MQ_SUCCESS;
 }
 
@@ -816,6 +843,13 @@ shm_mq_wait_for_attach(shm_mq_handle *mqh)
 void
 shm_mq_detach(shm_mq_handle *mqh)
 {
+	/* Before detaching, notify already written data to the receiver. */
+	if (mqh->mqh_send_pending > 0)
+	{
+		shm_mq_inc_bytes_written(mqh->mqh_queue, mqh->mqh_send_pending);
+		mqh->mqh_send_pending = 0;
+	}
+
 	/* Notify counterparty that we're outta here. */
 	shm_mq_detach_internal(mqh->mqh_queue);
 
@@ -894,7 +928,7 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 
 		/* Compute number of ring buffer bytes used and available. */
 		rb = pg_atomic_read_u64(&mq->mq_bytes_read);
-		wb = pg_atomic_read_u64(&mq->mq_bytes_written);
+		wb = pg_atomic_read_u64(&mq->mq_bytes_written) + mqh->mqh_send_pending;
 		Assert(wb >= rb);
 		used = wb - rb;
 		Assert(used <= ringsize);
@@ -951,6 +985,9 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 		}
 		else if (available == 0)
 		{
+			/* Update the pending send bytes in the shared memory. */
+			shm_mq_inc_bytes_written(mq, mqh->mqh_send_pending);
+
 			/*
 			 * Since mq->mqh_counterparty_attached is known to be true at this
 			 * point, mq_receiver has been set, and it can't change once set.
@@ -959,6 +996,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			Assert(mqh->mqh_counterparty_attached);
 			SetLatch(&mq->mq_receiver->procLatch);
 
+			/*
+			 * We have just updated the mqh_send_pending bytes in the shared
+			 * memory so reset it.
+			 */
+			mqh->mqh_send_pending = 0;
+
 			/* Skip manipulation of our latch if nowait = true. */
 			if (nowait)
 			{
@@ -1009,13 +1052,14 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
 			 * MAXIMUM_ALIGNOF, and each read is as well.
 			 */
 			Assert(sent == nbytes || sendnow == MAXALIGN(sendnow));
-			shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow));
 
 			/*
-			 * For efficiency, we don't set the reader's latch here.  We'll do
-			 * that only when the buffer fills up or after writing an entire
-			 * message.
+			 * For efficiency, we don't update the bytes written in the shared
+			 * memory and also don't set the reader's latch here.  Refer to
+			 * the comments atop the shm_mq_handle structure for more
+			 * information.
 			 */
+			mqh->mqh_send_pending += MAXALIGN(sendnow);
 		}
 	}
 
diff --git a/src/include/storage/shm_mq.h b/src/include/storage/shm_mq.h
index e693f3f..cb1c555 100644
--- a/src/include/storage/shm_mq.h
+++ b/src/include/storage/shm_mq.h
@@ -70,11 +70,13 @@ extern shm_mq *shm_mq_get_queue(shm_mq_handle *mqh);
 
 /* Send or receive messages. */
 extern shm_mq_result shm_mq_send(shm_mq_handle *mqh,
-								 Size nbytes, const void *data, bool nowait);
-extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh,
-								  shm_mq_iovec *iov, int iovcnt, bool nowait);
+								 Size nbytes, const void *data, bool nowait,
+								 bool force_flush);
+extern shm_mq_result shm_mq_sendv(shm_mq_handle *mqh, shm_mq_iovec *iov,
+								  int iovcnt, bool nowait, bool force_flush);
 extern shm_mq_result shm_mq_receive(shm_mq_handle *mqh,
 									Size *nbytesp, void **datap, bool nowait);
+extern void shm_mq_flush(shm_mq_handle *mqh);
 
 /* Wait for our counterparty to attach to the queue. */
 extern shm_mq_result shm_mq_wait_for_attach(shm_mq_handle *mqh);
diff --git a/src/test/modules/test_shm_mq/test.c b/src/test/modules/test_shm_mq/test.c
index 2d8d695..be074f0 100644
--- a/src/test/modules/test_shm_mq/test.c
+++ b/src/test/modules/test_shm_mq/test.c
@@ -73,7 +73,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 	test_shm_mq_setup(queue_size, nworkers, &seg, &outqh, &inqh);
 
 	/* Send the initial message. */
-	res = shm_mq_send(outqh, message_size, message_contents, false);
+	res = shm_mq_send(outqh, message_size, message_contents, false, true);
 	if (res != SHM_MQ_SUCCESS)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -97,7 +97,7 @@ test_shm_mq(PG_FUNCTION_ARGS)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -177,7 +177,8 @@ test_shm_mq_pipelined(PG_FUNCTION_ARGS)
 		 */
 		if (send_count < loop_count)
 		{
-			res = shm_mq_send(outqh, message_size, message_contents, true);
+			res = shm_mq_send(outqh, message_size, message_contents, true,
+							  true);
 			if (res == SHM_MQ_SUCCESS)
 			{
 				++send_count;
diff --git a/src/test/modules/test_shm_mq/worker.c b/src/test/modules/test_shm_mq/worker.c
index 2180776..9b037b9 100644
--- a/src/test/modules/test_shm_mq/worker.c
+++ b/src/test/modules/test_shm_mq/worker.c
@@ -190,7 +190,7 @@ copy_messages(shm_mq_handle *inqh, shm_mq_handle *outqh)
 			break;
 
 		/* Send it back out. */
-		res = shm_mq_send(outqh, len, data, false);
+		res = shm_mq_send(outqh, len, data, false, true);
 		if (res != SHM_MQ_SUCCESS)
 			break;
 	}
-- 
1.8.3.1

#17Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#9)
Re: Gather performance analysis

On 9/8/21 8:05 AM, Dilip Kumar wrote:

On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:

Hi,

The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.

Thanks for showing interest.

Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.

I have attached the output of the lscpu.  I think batching the data
before updating in the shared memory will win because we are avoiding
the frequent cache misses and IMHO the benefit will be more in the
machine with more CPU sockets.

Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?

Actually, my intention behind the tuple queue size was to just see the
behavior. Do we really have the problem of workers stalling on queue
while sending the tuple, the perf report showed some load on WaitLatch
on the worker side so I did this experiment.  I saw some benefits but it
was not really huge.  I am not sure whether we want to just increase the
tuple queue size or make it tunable,  but if we want to support
redistribute operators in future sometime then maybe we should make it
dynamically growing at runtime, maybe using dsa or dsa + shared files.

Thanks. I ran a couple more benchmarks, with different queue sizes
(16kB, 64kB, 256kB and 1MB) and according to the results the queue size
really makes almost no difference. It might make a difference for some
queries, but I wouldn't bother tuning this until we have a plausible
example - increasing the queue size is not free either.

So it was worth checking, but I'd just leave it as 64kB for now. We may
revisit this later in a separate patch/thread.

As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.

Yeah, I was thinking about such cases, basically, this design can
increase the startup cost of the Gather node, I will also try to derive
such cases and test them.

FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?

Actually, these both patches are the same,
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the
cleaner version of the first patch.  For configurable tuple queue size I
did not send a patch, because that is I just used for the testing
purpose and never intended to to propose anything.  My most of the
latest performance data I sent with only
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with
default tuple queue size.

But I am attaching both the patches in case you want to play around.

Ah, silly me. I should have noticed the second patch is just a refined
version of the first one.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#9)
Re: Gather performance analysis

On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

But I am attaching both the patches in case you want to play around.

I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.

Objections?

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Robert Haas (#18)
1 attachment(s)
Re: Gather performance analysis

On 9/23/21 9:31 PM, Robert Haas wrote:

On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

But I am attaching both the patches in case you want to play around.

I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.

Objections?

Yeah, it seems like a fairly clear win, according to the benchmarks.

I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.

Anyway, the other machine with a single CPU seems perfectly clean.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

queue-results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=queue-results.odsDownload
#20Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#19)
Re: Gather performance analysis

On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.

That is pretty odd. I'm inclined to mostly discount the runs with
10000 tuples because sending such a tiny number of tuples doesn't
really take any significant amount of time, and it seems possible that
variations in the runtime of other code due to code movement effects
could end up mattering more than the changes to the performance of
shm_mq. However, the results with a million tuples seem like they're
probably delivering statistically significant results ... and I guess
maybe what's happening is that the patch hurts when the tuples are too
big relative to the queue size.

I guess your columns are an md5 value each, which is 32 bytes +
overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
the value to shared memory when more than a quarter of the queue has
been filled, that probably means we flush every 4-5 tuples. I wonder
if that means we need a smaller threshold, like 1/8 of the queue size?
Or maybe the behavior should be adaptive somehow, depending on whether
the receiver ends up waiting for data? Or ... perhaps only small
tuples are worth batching, so that the threshold for posting to shared
memory should be a constant rather than a fraction of the queue size?
I guess we need to know why we see the time spike up in those cases,
if we want to improve them.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Robert Haas (#20)
Re: Gather performance analysis

On 9/23/21 10:31 PM, Robert Haas wrote:

On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.

That is pretty odd. I'm inclined to mostly discount the runs with
10000 tuples because sending such a tiny number of tuples doesn't
really take any significant amount of time, and it seems possible that
variations in the runtime of other code due to code movement effects
could end up mattering more than the changes to the performance of
shm_mq. However, the results with a million tuples seem like they're
probably delivering statistically significant results ... and I guess
maybe what's happening is that the patch hurts when the tuples are too
big relative to the queue size.

Agreed on 10k rows being too small, we can ignore that. And yes, binary
layout might make a difference, of course. My rule of thumb is 5% (in
both directions) is about the difference that might make, and most
results are within that range.

I guess your columns are an md5 value each, which is 32 bytes +
overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
the value to shared memory when more than a quarter of the queue has
been filled, that probably means we flush every 4-5 tuples. I wonder
if that means we need a smaller threshold, like 1/8 of the queue size?
Or maybe the behavior should be adaptive somehow, depending on whether
the receiver ends up waiting for data? Or ... perhaps only small
tuples are worth batching, so that the threshold for posting to shared
memory should be a constant rather than a fraction of the queue size?
I guess we need to know why we see the time spike up in those cases,
if we want to improve them.

Not sure about this, because

(a) That should affect both CPUs, I think, but i5-2500k does not have
any such issue.

(b) One thing I haven't mentioned is I tried with larger queue sizes too
(that's the 16kB, 64kB, 256kB and 1MB in columns). Although it's true
larger queue improve the situation a bit.

(c) This can't explain the slowdown for cases without any Gather nodes
(and it's ~17%, so unlikely due to binary layout).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#21)
Re: Gather performance analysis

On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

(c) This can't explain the slowdown for cases without any Gather nodes
(and it's ~17%, so unlikely due to binary layout).

Yeah, but none of the modified code would even execute in those cases,
so it's either binary layout, something wrong in your test
environment, or gremlins.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#20)
Re: Gather performance analysis

On Fri, Sep 24, 2021 at 2:01 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.

That is pretty odd. I'm inclined to mostly discount the runs with
10000 tuples because sending such a tiny number of tuples doesn't
really take any significant amount of time, and it seems possible that
variations in the runtime of other code due to code movement effects
could end up mattering more than the changes to the performance of
shm_mq. However, the results with a million tuples seem like they're
probably delivering statistically significant results ... and I guess
maybe what's happening is that the patch hurts when the tuples are too
big relative to the queue size.

I am looking at the "query-results.ods" file shared by Tomas, with a
million tuple I do not really see where the patch hurts? because I am
seeing in most of the cases the time taken by the patch is 60-80%
compared to the head. And the worst case with a million tuple is
100.32% are are we pointing to that 0.32% or there is something else
that I am missing here.

I guess your columns are an md5 value each, which is 32 bytes +
overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
the value to shared memory when more than a quarter of the queue has
been filled, that probably means we flush every 4-5 tuples. I wonder
if that means we need a smaller threshold, like 1/8 of the queue size?
Or maybe the behavior should be adaptive somehow, depending on whether
the receiver ends up waiting for data? Or ... perhaps only small
tuples are worth batching, so that the threshold for posting to shared
memory should be a constant rather than a fraction of the queue size?
I guess we need to know why we see the time spike up in those cases,
if we want to improve them.

I will test with the larger tuple sizes and will see the behavior with
different thresholds. With 250 bytes tuple size, I have tested with
different thresholds and it appeared that 1/4 of the queue size works
best. But I will do more detailed testing and share the results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#24Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#19)
Re: Gather performance analysis

On Fri, Sep 24, 2021 at 1:30 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/23/21 9:31 PM, Robert Haas wrote:

On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

But I am attaching both the patches in case you want to play around.

I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.

Objections?

Yeah, it seems like a fairly clear win, according to the benchmarks.

I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.

Anyway, the other machine with a single CPU seems perfectly clean.

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#25Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#24)
Re: Gather performance analysis

On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

I think it's probably the run.sh file attached to
/messages/by-id/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com

--
Robert Haas
EDB: http://www.enterprisedb.com

#26Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#23)
Re: Gather performance analysis

On Fri, Sep 24, 2021 at 1:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I am looking at the "query-results.ods" file shared by Tomas, with a
million tuple I do not really see where the patch hurts? because I am
seeing in most of the cases the time taken by the patch is 60-80%
compared to the head. And the worst case with a million tuple is
100.32% are are we pointing to that 0.32% or there is something else
that I am missing here.

The spreadsheet has two tabs. Flip to "xeon e5-2620v3" and scroll all
the way down.

--
Robert Haas
EDB: http://www.enterprisedb.com

#27Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Robert Haas (#22)
Re: Gather performance analysis

On 9/24/21 1:43 AM, Robert Haas wrote:

On Thu, Sep 23, 2021 at 5:36 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

(c) This can't explain the slowdown for cases without any Gather nodes
(and it's ~17%, so unlikely due to binary layout).

Yeah, but none of the modified code would even execute in those cases,
so it's either binary layout, something wrong in your test
environment, or gremlins.

I'm not going to the office all that often these days, but I think I'm
sure I'd notice a bunch of gremlins running around ... so it's probably
the binary layout thing. But still, strange.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Robert Haas (#25)
Re: Gather performance analysis

On 9/24/21 7:08 PM, Robert Haas wrote:

On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

I think it's probably the run.sh file attached to
/messages/by-id/d76a759d-9240-94f5-399e-ae244e5f0285@enterprisedb.com

Yep. I've used a slightly modified version, which also varies the queue
size, but we can ignore that and the rest of the script is the same.

As for the config, I've used only slightly tuned postgresql.conf, with
shared_buffers set to 1GB, and increased worker limits.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#28)
2 attachment(s)
Re: Gather performance analysis

On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/24/21 7:08 PM, Robert Haas wrote:

On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

For now I have tested for 1M and 10M rows, shared buffers=16GM, for
now tested with default batching 1/4th of the queue size and I can see
the performance gain is huge. Time taken with the patch is in the
range of 37-90% compared to the master. Please refer to the attached
file for more detailed results. I could not see any regression that
Tomas saw, still I am planning to repeat it with different batch
sizes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=results.odsDownload
cpuinfoapplication/octet-stream; name=cpuinfoDownload
#30Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#29)
1 attachment(s)
Re: Gather performance analysis

On Sun, Sep 26, 2021 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Sep 25, 2021 at 2:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/24/21 7:08 PM, Robert Haas wrote:

On Fri, Sep 24, 2021 at 3:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Tomas, can you share your test script, I would like to repeat the same
test in my environment and with different batching sizes.

For now I have tested for 1M and 10M rows, shared buffers=16GM, for
now tested with default batching 1/4th of the queue size and I can see
the performance gain is huge. Time taken with the patch is in the
range of 37-90% compared to the master. Please refer to the attached
file for more detailed results. I could not see any regression that
Tomas saw, still I am planning to repeat it with different batch
sizes.

I have done testing with different batch sizes, 16k (which is the same
as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k.

In the attached sheet I have done a comparison of
1. head vs patch (1/4 queue size) = execution time reduced to 37% to
90% this is the same as the old sheet.
2. patch (1/4 queue size) vs patch(8k batch) = both are same, but 8k
batch size is slow in some cases.
3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k
batch size is slow in some cases (even slower than 8k batch size).
4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is
significantly slow.

With these results, 1/4 of the queue size seems to be the winner and I
think we might go for that value, however someone might think that 4k
batch size is optimal because it is just marginally slow and with that
we will have to worry less about increasing the latency in some worse
case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=results.odsDownload
#31Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#30)
Re: Gather performance analysis

On Mon, Sep 27, 2021 at 1:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have done testing with different batch sizes, 16k (which is the same
as 1/4 of the queue size with 64k queue size) , 8k, 4k, 2k.

In the attached sheet I have done a comparison of
1. head vs patch (1/4 queue size) = execution time reduced to 37% to
90% this is the same as the old sheet.
2. patch (1/4 queue size) vs patch(8k batch) = both are same, but 8k
batch size is slow in some cases.
3. patch (1/4 queue size) vs patch(4k batch) = both are same, but 4k
batch size is slow in some cases (even slower than 8k batch size).
4. patch (1/4 queue size) vs patch(2k batch) = 2k batch size is
significantly slow.

Generally these results seem to show that a larger batch size is
better than a smaller one, but we know that's not true everywhere and
under all circumstances, because some of Tomas's numbers are worse
than the unpatched cases. And I think we don't really understand the
reason for those results. Now it could be that there's no really good
reason for those results, and it's just something weird or not very
generally interesting.

On the other hand, while it's easy to see that batching can be a win
if it avoids contention, it also seems easy to imagine that it can be
a loss. By postponing the update to shared memory, we are essentially
gambling. If nobody would have read the updated value anyway, we win,
because we avoided doing work that wasn't really needed by
consolidating multiple updates of shared memory down to one. However,
imagine the scenario where someone reads a value that is small enough
that they have to block, because they think there's no more data
available. If there really is more data available and we just didn't
update shared memory, then we lose.

Here, the wins are going to be much smaller than the losses. Cache
line contention isn't cheap, but it's a lot cheaper than actually
having a process go to sleep and having to wake it up again. So
essentially the patch is betting that the winning scenario is much
more common than the losing scenario - the occasional big losses when
the reader sleeps unnecessarily will be more than counterbalanced by
the small wins every time we skip an update to shared memory without
causing that to happen.

And most of the time, that's probably a good bet. But, if you do
somehow hit the losing case repeatedly, then you could see a
significant regression. And that might explain Tomas's results.
Perhaps for some reason they just happen to hit that case over and
over again. If that's true, it would be useful to know why it happens
in that case and not others, because then maybe we could avoid the
problem somehow. However, I'm not sure how to figure that out, and I'm
not even entirely sure it's important to figure it out.

--
Robert Haas
EDB: http://www.enterprisedb.com

#32Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#31)
Re: Gather performance analysis

On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

And most of the time, that's probably a good bet. But, if you do
somehow hit the losing case repeatedly, then you could see a
significant regression. And that might explain Tomas's results.
Perhaps for some reason they just happen to hit that case over and
over again. If that's true, it would be useful to know why it happens
in that case and not others, because then maybe we could avoid the
problem somehow. However, I'm not sure how to figure that out, and I'm
not even entirely sure it's important to figure it out.

Yeah, if it is losing in some cases then it is definitely good to know
the reason, I was just looking into the performance numbers shared by
Tomas in query-results, I can see the worst case is
with 10000000 rows, 10 columns and 4 threads and queue size 64k.
Basically, if we see the execution time with head is ~804ms whereas
with patch it is ~1277 ms. But then I just tried to notice the
pattern with different queue size so number are like below,

16k 64k 256k 1024k
Head 1232.779 804.24 1134.723 901.257
Patch 1371.493 1277.705 862.598 783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern. So this particular dip in the execution time on
the head looks a bit suspicious to me. I mean how could we justify
this sudden big dip in execution time w.r.t the other pattern.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#32)
Re: Gather performance analysis

On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

And most of the time, that's probably a good bet. But, if you do
somehow hit the losing case repeatedly, then you could see a
significant regression. And that might explain Tomas's results.
Perhaps for some reason they just happen to hit that case over and
over again. If that's true, it would be useful to know why it happens
in that case and not others, because then maybe we could avoid the
problem somehow. However, I'm not sure how to figure that out, and I'm
not even entirely sure it's important to figure it out.

Yeah, if it is losing in some cases then it is definitely good to know
the reason, I was just looking into the performance numbers shared by
Tomas in query-results, I can see the worst case is
with 10000000 rows, 10 columns and 4 threads and queue size 64k.
Basically, if we see the execution time with head is ~804ms whereas
with patch it is ~1277 ms. But then I just tried to notice the
pattern with different queue size so number are like below,

16k 64k 256k 1024k
Head 1232.779 804.24 1134.723 901.257
Patch 1371.493 1277.705 862.598 783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern. So this particular dip in the execution time on
the head looks a bit suspicious to me.

I concur with your observation. Isn't it possible to repeat the same
test in the same environment to verify these results?

--
With Regards,
Amit Kapila.

#34Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#33)
Re: Gather performance analysis

On 9/28/21 12:53 PM, Amit Kapila wrote:

On Tue, Sep 28, 2021 at 12:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 27, 2021 at 10:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

And most of the time, that's probably a good bet. But, if you do
somehow hit the losing case repeatedly, then you could see a
significant regression. And that might explain Tomas's results.
Perhaps for some reason they just happen to hit that case over and
over again. If that's true, it would be useful to know why it happens
in that case and not others, because then maybe we could avoid the
problem somehow. However, I'm not sure how to figure that out, and I'm
not even entirely sure it's important to figure it out.

Yeah, if it is losing in some cases then it is definitely good to know
the reason, I was just looking into the performance numbers shared by
Tomas in query-results, I can see the worst case is
with 10000000 rows, 10 columns and 4 threads and queue size 64k.
Basically, if we see the execution time with head is ~804ms whereas
with patch it is ~1277 ms. But then I just tried to notice the
pattern with different queue size so number are like below,

16k 64k 256k 1024k
Head 1232.779 804.24 1134.723 901.257
Patch 1371.493 1277.705 862.598 783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern. So this particular dip in the execution time on
the head looks a bit suspicious to me.

I concur with your observation. Isn't it possible to repeat the same
test in the same environment to verify these results?

I can repeat any tests we need on that machine, of course.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#34)
Re: Gather performance analysis

On Tue, Sep 28, 2021 at 5:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Yeah, if it is losing in some cases then it is definitely good to know
the reason, I was just looking into the performance numbers shared by
Tomas in query-results, I can see the worst case is
with 10000000 rows, 10 columns and 4 threads and queue size 64k.
Basically, if we see the execution time with head is ~804ms whereas
with patch it is ~1277 ms. But then I just tried to notice the
pattern with different queue size so number are like below,

16k 64k 256k 1024k
Head 1232.779 804.24 1134.723 901.257
Patch 1371.493 1277.705 862.598 783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern. So this particular dip in the execution time on
the head looks a bit suspicious to me.

I concur with your observation. Isn't it possible to repeat the same
test in the same environment to verify these results?

I can repeat any tests we need on that machine, of course.

I think that would be great, can we just test this specific target
where we are seeing a huge dip with the patch, e.g.
with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In
my performance machine, I tried to run this test multiple times but on
the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
so I am not able to reproduce this. So it would be good if you can
run only this specific test and repeat it a couple of times on your
performance machine.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#36Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#32)
Re: Gather performance analysis

On Tue, Sep 28, 2021 at 2:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

16k 64k 256k 1024k
Head 1232.779 804.24 1134.723 901.257
Patch 1371.493 1277.705 862.598 783.481

So what I have noticed is that in most of the cases on head as well as
with the patch, increasing the queue size make it faster, but with
head suddenly for this particular combination of rows, column and
thread the execution time is very low for 64k queue size and then
again the execution time increased with 256k queue size and then
follow the pattern. So this particular dip in the execution time on
the head looks a bit suspicious to me. I mean how could we justify
this sudden big dip in execution time w.r.t the other pattern.

Oh, interesting. So there's not really a performance regression here
so much as that one particular case ran exceptionally fast on the
unpatched code.

--
Robert Haas
EDB: http://www.enterprisedb.com

#37Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#35)
3 attachment(s)
Re: Gather performance analysis

On 9/28/21 14:00, Dilip Kumar wrote:

I think that would be great, can we just test this specific target
where we are seeing a huge dip with the patch, e.g.
with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In
my performance machine, I tried to run this test multiple times but on
the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
so I am not able to reproduce this. So it would be good if you can
run only this specific test and repeat it a couple of times on your
performance machine.

I ran the benchmark again, with 10 runs instead of 5, the results and
scripts are attached. It seems the worst case got much better and is now
in line with the rest of the results, so it probably was a coincidence.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

run.shapplication/x-shellscript; name=run.shDownload
results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=results.odsDownload
postgresql.conftext/plain; charset=UTF-8; name=postgresql.confDownload
#38Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#37)
Re: Gather performance analysis

On Tue, Oct 12, 2021 at 6:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 9/28/21 14:00, Dilip Kumar wrote:

I think that would be great, can we just test this specific target
where we are seeing a huge dip with the patch, e.g.
with 10000000 rows, 10 columns and 4 threads, and queue size 64k. In
my performance machine, I tried to run this test multiple times but on
the head, it is taking ~2000 ms whereas with the patch it is ~1500 ms,
so I am not able to reproduce this. So it would be good if you can
run only this specific test and repeat it a couple of times on your
performance machine.

I ran the benchmark again, with 10 runs instead of 5, the results and
scripts are attached. It seems the worst case got much better and is now
in line with the rest of the results, so it probably was a coincidence.

Thanks, yeah now it looks in line with other results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#39Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#38)
Re: Gather performance analysis

On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, yeah now it looks in line with other results.

Since it seems there are no remaining concerns here, and we have
benchmarking results showing that the patch helps, I have committed
the patch.

I wonder whether the new code in shm_mq_send_bytes() should guard
against calling shm_mq_inc_bytes_written() with a second argument of
0, or alternatively whether shm_mq_inc_bytes_written() should have an
internal defense against that. It might save some writes to shared
memory, but it would also add a branch, which isn't free, either.

I also think that, as a followup action item, we need to reassess
parallel_tuple_cost.

--
Robert Haas
EDB: http://www.enterprisedb.com

#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#39)
Re: Gather performance analysis

On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, yeah now it looks in line with other results.

Since it seems there are no remaining concerns here, and we have
benchmarking results showing that the patch helps, I have committed
the patch.

Can we mark the corresponding CF entry [1]https://commitfest.postgresql.org/35/3304/ as committed?

[1]: https://commitfest.postgresql.org/35/3304/

--
With Regards,
Amit Kapila.

#41Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#40)
Re: Gather performance analysis

On Tue, Oct 26, 2021 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 15, 2021 at 1:48 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 12, 2021 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, yeah now it looks in line with other results.

Since it seems there are no remaining concerns here, and we have
benchmarking results showing that the patch helps, I have committed
the patch.

Can we mark the corresponding CF entry [1] as committed?

[1] - https://commitfest.postgresql.org/35/3304/

Done!

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com