Using distinct in an aggregate prevents parallel execution?

Started by Thomas Kellereralmost 8 years ago3 messagesgeneral

Thomas Kellerer

spam_eater@gmx.net

almost 8 years ago

Consider this simplified example:

select c.id,
count(*) as total_orders,
sum(p.price) as total_value
from customer c
join orders o ON c.id = o.customer_id
join order_line ol ON o.id = ol.order_id
join product p ON ol.product_id = p.id
group by c.id;

This uses parallel execution quite nicely: https://explain.depesz.com/s/aSPNn

However, the query is incorrect as it does not count the number of orders, but (essentially) the number of order_lines.

This can easily be fixed using:

select c.id,
count(distinct o.id) as total_orders,
sum(p.price) as total_value
from customer c
join orders o ON c.id = o.customer_id
join order_line ol ON o.id = ol.order_id
join product p ON ol.product_id = p.id
group by c.id;

But in that case Postgres 10.4 decides to no longer use parallel execution: https://explain.depesz.com/s/7Ua3

Which increases the query execution time quite a bit (from 3 to 8 seconds).

Is this a known limitation?

Thomas

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Thomas Kellerer (#1)

Re: Using distinct in an aggregate prevents parallel execution?

Thomas Kellerer <spam_eater@gmx.net> writes:

Is this a known limitation?

Yes, unless somebody has done radical restructuring of the aggregation
code while I wasn't looking.

agg(DISTINCT ...) is currently implemented inside the Agg plan node,
so it's an indivisible black box to everything else. That was a
simple, minimum-code-footprint method for implementing the feature
back when; but it's got lots of drawbacks, and one is that there's
no reasonable way to parallelize.

I'd anticipate that before we could even start to think of parallelizing,
we'd have to split out the distinct-ification processing into a separate
plan node.

agg(... ORDER BY ...) has got the same problem, and it'd likely be
advisable to fix that at the same time.

regards, tom lane

Thomas Kellerer

spam_eater@gmx.net

almost 8 years ago

In reply to: Tom Lane (#2)

Re: Using distinct in an aggregate prevents parallel execution?

Tom Lane schrieb am 06.06.2018 um 16:32:

Thomas Kellerer <spam_eater@gmx.net> writes:

Is this a known limitation?

Yes, unless somebody has done radical restructuring of the aggregation
code while I wasn't looking.

agg(DISTINCT ...) is currently implemented inside the Agg plan node,
so it's an indivisible black box to everything else. That was a
simple, minimum-code-footprint method for implementing the feature
back when; but it's got lots of drawbacks, and one is that there's
no reasonable way to parallelize.

I'd anticipate that before we could even start to think of parallelizing,
we'd have to split out the distinct-ification processing into a separate
plan node.

agg(... ORDER BY ...) has got the same problem, and it'd likely be
advisable to fix that at the same time.

Thansk for the explanation.