Gather performance analysis
Hi,
I have been working on analyzing the performance of sending the tuple
from workers to the Gather using the tuple queue. In the past there
were many off-list discussions around this area, basically, the main
point is that when the "shm_mq" was implemented that time maybe this
was one of the best ways to implement this. But now, we have other
choices like DSA for allocating shared memory on-demand, shared
temporary files for non-blocking tuple queue.
So my motivation for looking into this area is that now, we have
another flexible alternative so can we use them to make gather faster
and if so then
1. Can we actually reduce the tuple transfer cost and enable
parallelism in more cases by reducing parallel_tuple_cost.
2. Can we use the tuple queue in more places, e.g., to implement the
redistribute operator where we need to transfer data between the
workers.
IMHO for #1, it will be good enough if we can make the tuple transfer
faster, but for #2, we will have to make a) tuple transfer faster
because then we will have to transfer the tuples between the workers
as well b) Infinite non-blocking tuple queue(maybe using shared temp
file) so that there is no deadlock while workers are redistributing
tuples to each other.
So I have done some quick performance tests and analysis using perf,
and some experiments with small prototypes for targeting a different
set of problems.
--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;
Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interrupt
Perf analysis: Worker Node
- 99.14% shm_mq_sendv
- 74.10% shm_mq_send_bytes
+ 42.35% shm_mq_inc_bytes_written
- 32.56% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- 86.27% apic_timer_interrupt
+ 17.93% WaitLatch
From the perf results and also from the code analysis I can think of
two main problems here
1. Schyncronization between the worker and gather node, just to
identify the bytes written and read they need to do at least 2-3
atomic operations for each tuple and I think that is having huge
penalty due to a) frequent cache line invalidation b) a lot of atomic
operations.
2. If the tuple queue is full then the worker might need to wait for
the gather to consume the tuple.
Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions
1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.
Machine information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms
2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms
3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
Observation:
- As expected the results show that forcing the parallelism (by
reducing the parallel_tuple_cost), drastically impacts the
performance.
- But in the same scenario, with the patch, we can see a huge gain of ~40%
- Even if we compare it with the non-parallel plan we have gain ~25%.
- With this, I think we can conclude that there is a huge potential
for improvement if we communicate the tuple in batches, 1) one simple
approach is what I used in my experiment, I think we can do some
optimization in the reader as well, that instead of reading
bytes_written every time from shared memory remember the previous
value and once we have exhausted that then only read back the updated
value from the shared memory. 2) Instead of copying the whole tuple
in the tuple queue we can copy store the dsa_pointers of the tuple
batch, I think Thomas Munro also suggested a similar approach to
Robert, got to know this in offlist discussion with Robert.
Experiment #2: See the behavior by increasing the parallel tuple queue
size on head
(for this I created a small patch to make parallel_tuple_queue size
configurable)
-- Results
4 WORKERS (tup_queue size= 64kB) : 38337.046 ms
4 WORKERS (tup_queue size= 1MB) : 36186.883 ms
4 WORKERS (tup_queue size= 4MB) : 36252.740 ms
8 WORKERS (tup_queue size= 64kB) : 42296.731 ms
8 WORKERS (tup_queue size= 1MB) : 37403.872 ms
8 WORKERS (tup_queue size= 4MB) : 39184.319 ms
16 WORKERS (tup_queue size= 64kB) : 42726.139 ms
16 WORKERS (tup_queue size= 1MB) : 36219.975 ms
16 WORKERS (tup_queue size= 4MB) : 39117.109 ms
Observation:
- There are some gains by increasing the tuple queue size but that is
limited up to 1MB, even tried with more data but the gain is not
linear and performance starts to drop after 4MB.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)
Conclusion:
With the above experiments,
1) I see a huge potential in the first idea so maybe we can do more
experiments based on the prototype implemented in the first idea and
we can expand the same for the reader and we can also try out the idea
of the dsa_pointers.
2) with the second idea of tuple queue size, I see some benefit but
that is not scaling so maybe, for now, there is no much point in
pursuing in this direction, but I think in the future if we want to
implement the redistribute operator then it is must for providing an
infinite tuple queue (maybe using temp file) to avoid deadlock.
Note: POC patches are not attached, I will send them after some more
experiments and cleanup, maybe I will try to optimize the reader part
as well before sending them.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 6, 2021 at 2:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Experiment #1:
As part of this experiment, I have modified the sender to keep the
local copy of "mq_bytes_read" and "mq_bytes_written" in the local mqh
handle so that we don't need to frequently read/write cache sensitive
shared memory variables. So now we only read/write from the shared
memory in the below conditions1) If the number of available bytes is not enough to send the tuple,
read the updated value of bytes read and also inform the reader about
the new writes.
2) After every 4k bytes written, update the shared memory variable and
inform the reader.
3) on detach for sending any remaining data.
...
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
Here is the POC patch for the same, apart from this extreme case I am
able to see improvement with this patch for normal parallel queries as
well.
Next, I will perform some more tests with different sets of queries to
see the improvements and post the results. I will also try to
optimize the reader on the similar line.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
POC-0001-Optimize-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=POC-0001-Optimize-shm_mq_send_bytes.patchDownload+59-39
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 ms
This strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.
But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)
...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...
Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.
Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.
In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.
Additionally, I've done some more experiments with artificial workloads, as
well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.
Test1: (Worker returns all tuples but only few tuples returns to the client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;
Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms
20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;
set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms
8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.
I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload+72-20
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000)
as i;set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.I am planning to do further testing for identifying the optimal batch size
in different workloads. WIth above workload I am seeing similar results
with batch size 4k to 16k (1/4 of the ring size) so in the attached patch I
have kept as 1/4 of the ring size. We might change that based on more
analysis and testing.--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
Some minor comments.
For shm_mq.c, existing comment says:
* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;
I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.
+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
The above can be written as:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
so that when force_flush is true, the other condition is not evaluated.
Cheers
On Sat, Aug 28, 2021 at 4:29 AM Zhihong Yu <zyu@yugabyte.com> wrote:
On Sat, Aug 28, 2021 at 12:11 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:On Tue, Aug 24, 2021 at 8:48 PM Robert Haas <robertmhaas@gmail.com>
wrote:On Fri, Aug 6, 2021 at 4:31 AM Dilip Kumar <dilipbalaut@gmail.com>
wrote:Results: (query EXPLAIN ANALYZE SELECT * FROM t;)
1) Non-parallel (default)
Execution Time: 31627.492 ms2) Parallel with 4 workers (force by setting parallel_tuple_cost to 0)
Execution Time: 37498.672 ms3) Same as above (2) but with the patch.
Execution Time: 23649.287 msThis strikes me as an amazingly good result. I guess before seeing
these results, I would have said that you can't reasonably expect
parallel query to win on a query like this because there isn't enough
for the workers to do. It's not like they are spending time evaluating
filter conditions or anything like that - they're just fetching tuples
off of disk pages and sticking them into a queue. And it's unclear to
me why it should be better to have a bunch of processes doing that
instead of just one. I would have thought, looking at just (1) and
(2), that parallelism gained nothing and communication overhead lost 6
seconds.But what this suggests is that parallelism gained at least 8 seconds,
and communication overhead lost at least 14 seconds. In fact...Right, good observation.
- If I apply both Experiment#1 and Experiment#2 patches together then,
we can further reduce the execution time to 20963.539 ms (with 4
workers and 4MB tuple queue size)...this suggests that parallelism actually gained at least 10-11
seconds, and the communication overhead lost at least 15-16 seconds.Yes
If that's accurate, it's pretty crazy. We might need to drastically
reduce the value of parallel_tuple_cost if these results hold up and
this patch gets committed.In one of my experiments[Test1] I have noticed that even on the head the
force parallel plan is significantly faster compared to the non-parallel
plan, but with patch it is even better. The point is now also there might
be some cases where the force parallel plans are faster but we are not sure
whether we can reduce the parallel_tuple_cost or not. But with the patch
it is definitely sure that the parallel tuple queue is faster compared to
what we have now, So I agree we should consider reducing the
parallel_tuple_cost after this patch.Additionally, I've done some more experiments with artificial workloads,
as well as workloads where the parallel plan is selected by default, and in
all cases I've seen a significant improvement. The gain is directly
proportional to the load on the tuple queue, as expected.Test1: (Worker returns all tuples but only few tuples returns to the
client)
----------------------------------------------------
INSERT INTO t SELECT i%10, repeat('a', 200) from
generate_series(1,200000000) as i;
set max_parallel_workers_per_gather=4;Target Query: SELECT random() FROM t GROUP BY a;
Non-parallel (default plan): 77170.421 ms
Parallel (parallel_tuple_cost=0): 53794.324 ms
Parallel with patch (parallel_tuple_cost=0): 42567.850 ms20% gain compared force parallel, 45% gain compared to default plan.
Test2: (Parallel case with default parallel_tuple_cost)
----------------------------------------------
INSERT INTO t SELECT i, repeat('a', 200) from
generate_series(1,200000000) as i;set max_parallel_workers_per_gather=4;
SELECT * from t WHERE a < 17500000;
Parallel(default plan): 23730.054 ms
Parallel with patch (default plan): 21614.251 ms8 to 10 % gain compared to the default parallel plan.
I have done cleanup in the patch and I will add this to the September
commitfest.I am planning to do further testing for identifying the optimal batch
size in different workloads. WIth above workload I am seeing similar
results with batch size 4k to 16k (1/4 of the ring size) so in the attached
patch I have kept as 1/4 of the ring size. We might change that based on
more analysis and testing.--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.comHi,
Some minor comments.
For shm_mq.c, existing comment says:* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;I wonder if mqh_send_pending should be declared
after mqh_length_word_complete - this way, the order of fields matches the
order of explanation for the fields.+ if (mqh->mqh_send_pending > mq->mq_ring_size / 4 || force_flush)
The above can be written as:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 1))
so that when force_flush is true, the other condition is not evaluated.
Cheers
There was a typo in suggested code above. It should be:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
Cheers
Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.
Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.
FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2021-08-06 14:00:48 +0530, Dilip Kumar wrote:
--Setup
SET parallel_tuple_cost TO 0 -- to test parallelism in the extreme case
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
ANALYZE t;
Test query: EXPLAIN ANALYZE SELECT * FROM t;Perf analysis: Gather Node
- 43.57% shm_mq_receive
- 78.94% shm_mq_receive_bytes
- 91.27% pg_atomic_read_u64
- pg_atomic_read_u64_impl
- apic_timer_interrupt
smp_apic_timer_interruptPerf analysis: Worker Node - 99.14% shm_mq_sendv - 74.10% shm_mq_send_bytes + 42.35% shm_mq_inc_bytes_written - 32.56% pg_atomic_read_u64 - pg_atomic_read_u64_impl - 86.27% apic_timer_interrupt + 17.93% WaitLatchFrom the perf results and also from the code analysis I can think of
two main problems here
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for me.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual rows=200000000 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms │
│ Execution Time: 24954.012 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Looking at a profile I see the biggest bottleneck in the leader (which is the
bottleneck as soon as the worker count is increased) to be reading the length
word of the message. I do see shm_mq_receive_bytes() in the profile, but the
costly part there is the "read % (uint64) ringsize" - divisions are slow. We
could just compute a mask instead of the size.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly the
ringsize (or mask).
Greetings,
Andres Freund
On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.
Thanks for showing interest.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.
I have attached the output of the lscpu. I think batching the data before
updating in the shared memory will win because we are avoiding the frequent
cache misses and IMHO the benefit will be more in the machine with more CPU
sockets.
Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?
Actually, my intention behind the tuple queue size was to just see the
behavior. Do we really have the problem of workers stalling on queue while
sending the tuple, the perf report showed some load on WaitLatch on the
worker side so I did this experiment. I saw some benefits but it was not
really huge. I am not sure whether we want to just increase the tuple
queue size or make it tunable, but if we want to support redistribute
operators in future sometime then maybe we should make it dynamically
growing at runtime, maybe using dsa or dsa + shared files.
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.
Yeah, I was thinking about such cases, basically, this design can increase
the startup cost of the Gather node, I will also try to derive such cases
and test them.
FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?
Actually, these both patches are the same,
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the
cleaner version of the first patch. For configurable tuple queue size I
did not send a patch, because that is I just used for the testing purpose
and never intended to to propose anything. My most of the latest
performance data I sent with only
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with
default tuple queue size.
But I am attaching both the patches in case you want to play around.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload+72-20
v1-0002-poc-test-parallel_tuple_queue_size.patchtext/x-patch; charset=US-ASCII; name=v1-0002-poc-test-parallel_tuple_queue_size.patchDownload+13-2
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.
Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.
FWIW, I see times like this
postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual
rows=200000000 loops=1) │
│ Workers Planned: 2
│
│ Workers Launched: 2
│
│ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333
width=208) (actual rows=66666667 loops=3) │
│ Planning Time: 0.043 ms
│
│ Execution Time: 24954.012 ms
│└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?
Looking at a profile I see the biggest bottleneck in the leader (which is
the
bottleneck as soon as the worker count is increased) to be reading the
length
word of the message. I do see shm_mq_receive_bytes() in the profile, but
the
costly part there is the "read % (uint64) ringsize" - divisions are slow.
We
could just compute a mask instead of the size.
Yeah that could be done, I can test with this change as well that how much
we gain with this.
We also should probably split the read-mostly data in shm_mq (ring_size,
detached, ring_offset, receiver, sender) into a separate cacheline from the
read/write data. Or perhaps copy more info into the handle, particularly
the
ringsize (or mask).
Good suggestion, I will do some experiments around this.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for
me.Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.
I'm afraid that makes the profiles just about meaningless :(.
Is this with or without patch, I mean can we see a comparison that patch
improved anything in your environment?
It was without any patches. I'll try the patch in a bit.
Greetings,
Andres Freund
On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functions when using
--call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)"for
me.
Yeah, for profiling generally I build without optimizations so that I can
see all the functions in the stack, so yeah profile results are without
optimizations build but the performance results are with optimizations
build.I'm afraid that makes the profiles just about meaningless :(.
Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On 9/8/21 9:40 AM, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 12:03 PM Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:Hi,
On 2021-09-08 11:45:16 +0530, Dilip Kumar wrote:
On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <andres@anarazel.de
<mailto:andres@anarazel.de>> wrote:
Looking at this profile made me wonder if this was a build without
optimizations. Thepg_atomic_read_u64()/pg_atomic_read_u64_impl() calls
should
be inlined. And while perf can reconstruct inlined functionswhen using
--call-graph=dwarf, they show up like "pg_atomic_read_u64
(inlined)" for
me.
Yeah, for profiling generally I build without optimizations so
that I can
see all the functions in the stack, so yeah profile results are
without
optimizations build but the performance results are with optimizations
build.I'm afraid that makes the profiles just about meaningless :(.
Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.
IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.
It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.
As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.
Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.
The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Sep 8, 2021 at 3:28 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 9/8/21 9:40 AM, Dilip Kumar wrote:
Maybe it can be misleading sometimes, but I feel sometimes it is more
informative compared to the optimized build where it makes some function
inline, and then it becomes really hard to distinguish which function
really has the problem. But your point is taken and I will run with an
optimized build.IMHO Andres is right optimization may make profiles mostly useless in
most cases - it may skew timings for different parts differently, so
something that'd be optimized out may take much more time.It may provide valuable insights, but we definitely should not use such
binaries for benchmarking and comparisons of the patches.
Yeah, I completely agree that those binaries should not be used for
benchmarking and patch comparison and I never used it for that purpose. I
was also making the same point that with debug binaries sometimes we get
some valuable insight during profiling.
As mentioned, I did some benchmarks, and I do see some nice improvements
even with properly optimized builds -O2.Attached is a simple script that varies a bunch of parameters (number of
workers, number of rows/columns, ...) and then measures duration of a
simple query, similar to what you did. I haven't varied the queue size,
that might be interesting too.The PDF shows a comparison of master and the two patches. For 10k rows
there's not much difference, but for 1M and 10M rows there are some nice
improvements in the 20-30% range. Of course, it's just a single query in
a simple benchmark.
Thanks for the benchmarking.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 8, 2021 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
Based on various suggestions, I have some more experiments with the patch.
1) I have measured the cache misses count and I see a ~20% reduction
in cache misses with the patch (updating shared memory counter only
after we written certain amount of data).
command: perf stat -e
cycles,instructions,cache-references,cache-misses -p <receiver-pid>
Head:
13,918,480,258 cycles
21,082,968,730 instructions # 1.51 insn per
cycle
13,206,426 cache-references
12,432,402 cache-misses # 94.139 % of all
cache refs
Patch:
14,119,691,844 cycles
29,497,239,984 instructions # 2.09 insn per
cycle
4,245,819 cache-references
3,085,047 cache-misses # 72.661 % of all cache refs
I have taken multiple samples with different execution times, and I
can see the cache-misses with the patch is 72-74% whereas without the
patch it is 92-94%. So as expected these results clearly showing we
are saving a lot by avoiding cache misses.
2) As pointed by Tomas, I have tried different test cases, where this
patch can regress the performance
CREATE TABLE t (a int, b varchar);
INSERT INTO t SELECT i, repeat('a', 200) from generate_series(1,200000000) as i;
set enable_gathermerge=off;
Query: select * from t1 where a < 100000 order by a;
Plan:
Sort (cost=1714422.10..1714645.24 rows=89258 width=15)
-> Gather (cost=1000.00..1707082.55 rows=89258 width=15)
-> Parallel Seq Scan on t1 (cost=0.00..1706082.55
rows=22314 width=15)
Filter: (a < 100000)
So the idea is, that without a patch we should immediately get the
tuple to the sort node whereas with a patch there would be some delay
before we send the tuple to the gather node as we are batching. With
this also, I did not notice any consistent regression with the patch,
however, with explain analyze I have noticed 2-3 % drop with the
patch.
3. I tried some other optimizations, pointed by Andres,
a) Separating read-only and read-write data in shm_mq and also moving
some fields out of shm_mq
struct shm_mq (after change)
{
/* mostly read-only field*/
PGPROC *mq_receiver;
PGPROC *mq_sender;
bool mq_detached;
slock_t mq_mutex;
/* read-write fields*/
pg_atomic_uint64 mq_bytes_read;
pg_atomic_uint64 mq_bytes_written;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};
Note: mq_ring_size and mq_ring_offset moved to shm_mq_handle.
I did not see any extra improvement with this idea.
4. Another thought about changing the "mq_ring_size" to a mask
- I think this could improve something, but currently, "mq_ring_size"
is not the 2's power value so we can not convert this to a mask
directly.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sat, Aug 28, 2021 at 5:04 PM Zhihong Yu <zyu@yugabyte.com> wrote:
* mqh_partial_bytes, mqh_expected_bytes, and mqh_length_word_complete
+ Size mqh_send_pending;
bool mqh_length_word_complete;
bool mqh_counterparty_attached;I wonder if mqh_send_pending should be declared after mqh_length_word_complete - this way, the order of fields matches the order of explanation for the fields.
Moved it after mqh_consume_pending and moved comment as well in the
correct order.
There was a typo in suggested code above. It should be:
+ if (force_flush || mqh->mqh_send_pending > (mq->mq_ring_size >> 2))
Done
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachments:
v2-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patchDownload+71-20
On 9/8/21 8:05 AM, Dilip Kumar wrote:
On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:Hi,
The numbers presented in this thread seem very promising - clearly
there's significant potential for improvements. I'll run similar
benchmarks too, to get a better understanding of this.Thanks for showing interest.
Can you share some basic details about the hardware you used?
Particularly the CPU model - I guess this might explain some of the
results, e.g. if CPU caches are ~1MB, that'd explain why setting
tup_queue_size to 1MB improves things, but 4MB is a bit slower.
Similarly, number of cores might explain why 4 workers perform better
than 8 or 16 workers.I have attached the output of the lscpu. I think batching the data
before updating in the shared memory will win because we are avoiding
the frequent cache misses and IMHO the benefit will be more in the
machine with more CPU sockets.Now, this is mostly expected, but the consequence is that maybe things
like queue size should be tunable/dynamic, not hard-coded?Actually, my intention behind the tuple queue size was to just see the
behavior. Do we really have the problem of workers stalling on queue
while sending the tuple, the perf report showed some load on WaitLatch
on the worker side so I did this experiment. I saw some benefits but it
was not really huge. I am not sure whether we want to just increase the
tuple queue size or make it tunable, but if we want to support
redistribute operators in future sometime then maybe we should make it
dynamically growing at runtime, maybe using dsa or dsa + shared files.
Thanks. I ran a couple more benchmarks, with different queue sizes
(16kB, 64kB, 256kB and 1MB) and according to the results the queue size
really makes almost no difference. It might make a difference for some
queries, but I wouldn't bother tuning this until we have a plausible
example - increasing the queue size is not free either.
So it was worth checking, but I'd just leave it as 64kB for now. We may
revisit this later in a separate patch/thread.
As for the patches, I think the proposed changes are sensible, but I
wonder what queries might get slower. For example with the batching
(updating the counter only once every 4kB, that pretty much transfers
data in larger chunks with higher latency. So what if the query needs
only a small chunk, like a LIMIT query? Similarly, this might mean the
upper parts of the plan have to wait for the data for longer, and thus
can't start some async operation (like send them to a FDW, or something
like that). I do admit those are theoretical queries, I haven't tried
creating such query.Yeah, I was thinking about such cases, basically, this design can
increase the startup cost of the Gather node, I will also try to derive
such cases and test them.FWIW I've tried applying both patches at the same time, but there's a
conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
the correct solution. Can you share a "combined" patch?Actually, these both patches are the same,
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is the
cleaner version of the first patch. For configurable tuple queue size I
did not send a patch, because that is I just used for the testing
purpose and never intended to to propose anything. My most of the
latest performance data I sent with only
"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and with
default tuple queue size.But I am attaching both the patches in case you want to play around.
Ah, silly me. I should have noticed the second patch is just a refined
version of the first one.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
But I am attaching both the patches in case you want to play around.
I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.
Objections?
--
Robert Haas
EDB: http://www.enterprisedb.com
On 9/23/21 9:31 PM, Robert Haas wrote:
On Wed, Sep 8, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
But I am attaching both the patches in case you want to play around.
I don't really see any reason not to commit 0001. Perhaps some very
minor grammatical nitpicking is in order here, but apart from that I
can't really see anything to criticize with this approach. It seems
safe enough, it's not invasive in any way that matters, and we have
benchmark results showing that it works well. If someone comes up with
something even better, no harm done, we can always change it again.Objections?
Yeah, it seems like a fairly clear win, according to the benchmarks.
I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.
Anyway, the other machine with a single CPU seems perfectly clean.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
On Thu, Sep 23, 2021 at 4:00 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
I did find some suspicious behavior on the bigger box I have available
(with 2x xeon e5-2620v3), see the attached spreadsheet. But it seems
pretty weird because the worst affected case is with no parallel workers
(so the queue changes should affect it). Not sure how to explain it, but
the behavior seems consistent.
That is pretty odd. I'm inclined to mostly discount the runs with
10000 tuples because sending such a tiny number of tuples doesn't
really take any significant amount of time, and it seems possible that
variations in the runtime of other code due to code movement effects
could end up mattering more than the changes to the performance of
shm_mq. However, the results with a million tuples seem like they're
probably delivering statistically significant results ... and I guess
maybe what's happening is that the patch hurts when the tuples are too
big relative to the queue size.
I guess your columns are an md5 value each, which is 32 bytes +
overhead, so a 20-columns tuple is ~1kB. Since Dilip's patch flushes
the value to shared memory when more than a quarter of the queue has
been filled, that probably means we flush every 4-5 tuples. I wonder
if that means we need a smaller threshold, like 1/8 of the queue size?
Or maybe the behavior should be adaptive somehow, depending on whether
the receiver ends up waiting for data? Or ... perhaps only small
tuples are worth batching, so that the threshold for posting to shared
memory should be a constant rather than a fraction of the queue size?
I guess we need to know why we see the time spike up in those cases,
if we want to improve them.
--
Robert Haas
EDB: http://www.enterprisedb.com