plan_rows confusion with parallel queries

Started by Tomas Vondraabout 9 years ago11 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

Consider for example a simple seq scan query, which in non-parallel
explain looks like this:

QUERY PLAN
---------------------------------------------------------------------
Seq Scan on tables t (cost=0.00..16347.60 rows=317160 width=356)
(actual rows=317160 loops=1)
Planning time: 0.173 ms
Execution time: 47.707 ms
(3 rows)

but a parallel plan looks like this:

QUERY PLAN
---------------------------------------------------------------------
Gather (cost=0.00..14199.10 rows=317160 width=356)
(actual rows=317160 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Parallel Seq Scan on tables t (cost=... rows=102310 width=356)
(actual rows=79290 loops=4)
Planning time: 0.209 ms
Execution time: 150.812 ms
(6 rows)

Now, for actual rows we can simply do 79290 * 4 = 317160, and we get the
correct number of rows produced by the plan (i.e. matching the
non-parallel query).

But for the estimate, it doesn't work like that:

102310 * 4 = 409240

which is ~30% above the actual estimate. It's clear why this is
happening - when computing plan_rows, we don't count the leader as a
full worker, but use this:

leader_contribution = 1.0 - (0.3 * path->parallel_workers);

so with 3 workers, the leader is only worth ~0.1 of a worker:

102310 * 3.1 = 317161

It's fairly easy to spot this here, because the Gather node is right
above the Parallel Seq Scan, and the values in the Gather accurate. But
in many plans the Gather will not be immediately above the node (e.g.
there may be parallel aggregate in between).

Of course, the fact that we use planned number of workers when computing
plan_rows but actual number of workers for actually produced rows makes
this even more confusing.

BTW is it really a good idea to use nloops to track the number of
workers executing a given node? How will that work if once we get
parallel nested loops and index scans?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#1)
Re: plan_rows confusion with parallel queries

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:

Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt
h=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8
)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)

There are actually 1000000 rows in pp, and none in pp1. I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: plan_rows confusion with parallel queries

On 11/02/2016 09:00 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:

Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt
h=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8
)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)

There are actually 1000000 rows in pp, and none in pp1. I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.

Yeah. I wonder how tools visualizing explain plans are going to compute
time spent in a given node (i.e. excluding the time spent in child
nodes), or expected cost of that node.

So far we could do something like

self_time = total_time - child_node_time * nloops

and example in this plan it's pretty clear we spend ~130ms in Aggregate:

QUERY PLAN
----------------------------------------------------------------------------
Aggregate (cost=17140.50..17140.51 rows=1 width=8)
(actual time=306.675..306.675 rows=1 loops=1)
-> Seq Scan on tables (cost=0.00..16347.60 rows=317160 width=0)
(actual time=0.188..170.993 rows=317160 loops=1)
Planning time: 0.201 ms
Execution time: 306.860 ms
(4 rows)

But in parallel plans it can easily happen that

child_node_time * nloops > total_time

Consider for example this parallel plan:

QUERY PLAN
----------------------------------------------------------------------------
Finalize Aggregate (cost=15455.19..15455.20 rows=1 width=8)
(actual time=107.636..107.636 rows=1 loops=1)
-> Gather (cost=15454.87..15455.18 rows=3 width=8)
(actual time=107.579..107.629 rows=4 loops=1)
Workers Planned: 3
Workers Launched: 3
-> Partial Aggregate (cost=14454.87..14454.88 rows=1 ...)
(actual time=103.895..103.895 rows=1 loops=4)
-> Parallel Seq Scan on tables
(cost=0.00..14199.10 rows=102310 width=0)
(actual time=0.059..59.217 rows=79290 loops=4)
Planning time: 0.052 ms
Execution time: 109.250 ms
(8 rows)

Reading explains for parallel plans will always be complicated, but
perhaps overloading the nloops like this makes it more complicated?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#3)
Re: plan_rows confusion with parallel queries

On 11/02/2016 11:56 PM, Tomas Vondra wrote:

On 11/02/2016 09:00 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:

Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67
rows=416667 widt
h=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29
rows=1329 width=8
)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000
width=8)

There are actually 1000000 rows in pp, and none in pp1. I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.

Although - it is estimating 1M rows, but only "per worker" estimates are
shown, and because there are 2 workers planned it says 1M/2.4 which is
the 416k. I agree it's a bit unclear, but at least it's consistent with
how we treat loops (i.e. that the numbers are per loop).

But there's more fun with joins - consider for example this simple join:

QUERY PLAN
------------------------------------------------------------------------------
Gather (cost=19515.96..43404.82 rows=96957 width=12)
(actual time=295.167..746.312 rows=99999 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=18515.96..32709.12 rows=96957 width=12)
(actual time=249.281..670.309 rows=33333 loops=3)
Hash Cond: (t2.a = t1.a)
-> Parallel Seq Scan on t2
(cost=0.00..8591.67 rows=416667 width=8)
(actual time=0.100..184.315 rows=333333 loops=3)
-> Hash (cost=16925.00..16925.00 rows=96957 width=8)
(actual time=246.760..246.760 rows=99999 loops=3)
Buckets: 131072 Batches: 2 Memory Usage: 2976kB
-> Seq Scan on t1
(cost=0.00..16925.00 rows=96957 width=8)
(actual time=0.065..178.385 rows=99999 loops=3)
Filter: (b < 100000)
Rows Removed by Filter: 900001
Planning time: 0.763 ms
Execution time: 793.653 ms
(13 rows)

Suddenly we don't show per-worker estimates for the hash join - both the
Hash Join and the Gather have exactly the same cardinality estimate.

Now, let's try forcing Nested Loops and see what happens:

QUERY PLAN
-----------------------------------------------------------------------------
Gather (cost=1000.42..50559.65 rows=96957 width=12)
(actual time=0.610..203.694 rows=99999 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.42..39863.95 rows=96957 width=12)
(actual time=0.222..182.755 rows=33333 loops=3)
-> Parallel Seq Scan on t1
(cost=0.00..9633.33 rows=40399 width=8)
(actual time=0.030..40.358 rows=33333 loops=3)
Filter: (b < 100000)
Rows Removed by Filter: 300000
-> Index Scan using t2_a_idx on t2
(cost=0.42..0.74 rows=1 width=8)
(actual time=0.002..0.002 rows=1 loops=99999)
Index Cond: (a = t1.a)
Planning time: 0.732 ms
Execution time: 250.707 ms
(11 rows)

So, different join method but same result - 2 workers, loops=3. But
let's try with small tables (100k rows instead of 1M rows):

QUERY PLAN
----------------------------------------------------------------------------
Gather (cost=0.29..36357.94 rows=100118 width=12) (actual
time=13.219..589.723 rows=100000 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
-> Nested Loop (cost=0.29..36357.94 rows=100118 width=12)
(actual time=0.288..442.821 rows=100000 loops=1)
-> Seq Scan on t1 (cost=0.00..1444.18 rows=100118 width=8)
(actual time=0.148..49.308 rows=100000 loops=1)
-> Index Scan using t2_a_idx on t2
(cost=0.29..0.34 rows=1 width=8)
(actual time=0.002..0.002 rows=1 loops=100000)
Index Cond: (a = t1.a)
Planning time: 0.483 ms
Execution time: 648.941 ms
(10 rows)

Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before).
FWIW I've only seen this with force_parallel_mode=on, and the row counts
are correct, so perhaps that's OK. single_copy seems a bit
underdocumented, though.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#4)
Re: plan_rows confusion with parallel queries

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On 11/02/2016 11:56 PM, Tomas Vondra wrote:

On 11/02/2016 09:00 PM, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:
Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 width=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)
There are actually 1000000 rows in pp, and none in pp1. I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.

Although - it is estimating 1M rows, but only "per worker" estimates are
shown, and because there are 2 workers planned it says 1M/2.4 which is
the 416k. I agree it's a bit unclear, but at least it's consistent with
how we treat loops (i.e. that the numbers are per loop).

Well, it's not *that* consistent. If we were estimating all the numbers
underneath the Gather as being per-worker numbers, that would make some
amount of sense. But neither the other seqscan, nor the hash on it, nor
the hashjoin's output count are scaled that way. It's very hard to call
the above display anything but flat-out broken.

But there's more fun with joins - consider for example this simple join:
...
Suddenly we don't show per-worker estimates for the hash join - both the
Hash Join and the Gather have exactly the same cardinality estimate.

Yeah. That doesn't seem to be quite the same problem as in my example,
but it's about as confused.

Maybe we need to bite the bullet and add a "number of workers" field
to the estimated and actual counts. Not sure how much that helps for
the partial-count-for-the-leader issue, though.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#1)
Re: plan_rows confusion with parallel queries

On Wed, Nov 2, 2016 at 2:42 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

BTW is it really a good idea to use nloops to track the number of workers
executing a given node? How will that work if once we get parallel nested
loops and index scans?

We already have parallel nested loops with inner index scans.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: plan_rows confusion with parallel queries

On Wed, Nov 2, 2016 at 4:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

while eye-balling some explain plans for parallel queries, I got a bit
confused by the row count estimates. I wonder whether I'm alone.

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:

Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 widt
h=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8
)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)

There are actually 1000000 rows in pp, and none in pp1. I'm not bothered
particularly by the nonzero estimate for pp1, because I know where that
came from, but I'm not very happy that nowhere here does it look like
it's estimating a million-plus rows going into the join.

I welcome suggestions for improvement, but you will note that if the
row count didn't reflect some kind of guess about the number of rows
that each individual worker will see, the costing would be hopelessly
broken. The cost needs to reflect a guess about the time the query
will finish, not the total amount of effort expended.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#4)
Re: plan_rows confusion with parallel queries

On Wed, Nov 2, 2016 at 10:44 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Although - it is estimating 1M rows, but only "per worker" estimates are
shown, and because there are 2 workers planned it says 1M/2.4 which is the
416k. I agree it's a bit unclear, but at least it's consistent with how we
treat loops (i.e. that the numbers are per loop).

Right. Which I think was a horrible decision. I think that it would
be best to change EXPLAIN so that the row counts and costs are never
divided by nloops. That would be a backward-incompatible change, but
I think it would be worth it. What you typically want to understand
is the total effort expended in a particular plan node, and the
current system makes that incredibly difficult to understand,
especially because we then round off the row count estimates to the
nearest integer, so that you can't even reverse the division if you
want to (which you always do).

But there's more fun with joins - consider for example this simple join:

QUERY PLAN
------------------------------------------------------------------------------
Gather (cost=19515.96..43404.82 rows=96957 width=12)
(actual time=295.167..746.312 rows=99999 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Join (cost=18515.96..32709.12 rows=96957 width=12)
(actual time=249.281..670.309 rows=33333 loops=3)
Hash Cond: (t2.a = t1.a)
-> Parallel Seq Scan on t2
(cost=0.00..8591.67 rows=416667 width=8)
(actual time=0.100..184.315 rows=333333 loops=3)
-> Hash (cost=16925.00..16925.00 rows=96957 width=8)
(actual time=246.760..246.760 rows=99999 loops=3)
Buckets: 131072 Batches: 2 Memory Usage: 2976kB
-> Seq Scan on t1
(cost=0.00..16925.00 rows=96957 width=8)
(actual time=0.065..178.385 rows=99999 loops=3)
Filter: (b < 100000)
Rows Removed by Filter: 900001
Planning time: 0.763 ms
Execution time: 793.653 ms
(13 rows)

Suddenly we don't show per-worker estimates for the hash join - both the
Hash Join and the Gather have exactly the same cardinality estimate.

I'm not sure why that's happening, but I haven't made any changes to
the costing for a node like hash join. It doesn't treat the parallel
sequential scan that is coming as its first input any differently than
it would if that were a non-parallel plan. It's just costing the join
normally, based on an input row count that is lower than what it would
be if it were going to see every row from t2 rather than only some of
them.

So, different join method but same result - 2 workers, loops=3. But let's
try with small tables (100k rows instead of 1M rows):

QUERY PLAN
----------------------------------------------------------------------------
Gather (cost=0.29..36357.94 rows=100118 width=12) (actual
time=13.219..589.723 rows=100000 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
-> Nested Loop (cost=0.29..36357.94 rows=100118 width=12)
(actual time=0.288..442.821 rows=100000 loops=1)
-> Seq Scan on t1 (cost=0.00..1444.18 rows=100118 width=8)
(actual time=0.148..49.308 rows=100000 loops=1)
-> Index Scan using t2_a_idx on t2
(cost=0.29..0.34 rows=1 width=8)
(actual time=0.002..0.002 rows=1 loops=100000)
Index Cond: (a = t1.a)
Planning time: 0.483 ms
Execution time: 648.941 ms
(10 rows)

Suddenly, we get nworkers=1 with loops=1 (and not nworkers+1 as before).
FWIW I've only seen this with force_parallel_mode=on, and the row counts are
correct, so perhaps that's OK. single_copy seems a bit underdocumented,
though.

This is certainly entirely as expected. Single-copy means that
there's one process running the non-parallel plan beneath it, and
that's it. So the Gather is just a pass-through node here, like a
Materialize or Sort: the number of input rows and the number of output
rows have to be the same.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: plan_rows confusion with parallel queries

On Wed, Nov 2, 2016 at 10:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I got confused by that a minute ago, so no you're not alone. The problem
is even worse in join cases. For example:
Gather (cost=34332.00..53265.35 rows=100 width=8)
Workers Planned: 2
-> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
-> Append (cost=0.00..8614.96 rows=417996 width=8)
-> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 width=8)
-> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8)
-> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
-> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)

Although - it is estimating 1M rows, but only "per worker" estimates are
shown, and because there are 2 workers planned it says 1M/2.4 which is
the 416k. I agree it's a bit unclear, but at least it's consistent with
how we treat loops (i.e. that the numbers are per loop).

Well, it's not *that* consistent. If we were estimating all the numbers
underneath the Gather as being per-worker numbers, that would make some
amount of sense. But neither the other seqscan, nor the hash on it, nor
the hashjoin's output count are scaled that way. It's very hard to call
the above display anything but flat-out broken.

While investigating why Rushabh Lathia's Gather Merge patch sometimes
fails to pick a Gather Merge plan even when it really ought to do so,
I ran smack into this problem. I discovered that this is more than a
cosmetic issue. The costing itself is actually badly broken. In the
single-table case, when you have just ...

Gather
-> Parallel Seq Scan

...the Parallel Seq Scan node reflects a per-worker row estimate, and
the Gather node reflects a total row estimate. But in the join case,
as shown above, the Gather thinks that the total number of rows which
it will produce is equal to the number that will be produced by one
single worker, which is crap, and the cost of doing the join in
parallel is based on the per-worker rather than the total number,
which is crappier. The difference in cost between the Gather and the
underlying join in the above example is exactly 1010, namely 1000 for
parallel_setup_cost and 100 tuples at 0.1 per tuple, even though 100
is the number of tuples per-worker, not the total number. That's
really not good. I probably should have realized this when I looked
at this thread the first time, but I somehow got it into my head that
this was just a complaint about the imperfections of the display
(which is indeed imperfect) and failed to realize that the same report
was also pointing to an actual costing bug. I apologize for that.

The reason why this is happening is that final_cost_nestloop(),
final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
about whether the path they are generating is partial. They apply the
row estimate for the joinrel itself to every such path generated for
the join, except for parameterized paths which are a special case. I
think this generally has the effect of discouraging parallel joins,
because the inflated row count also inflates the join cost. I think
the right thing to do is probably to scale the row count estimate for
the joinrel's partial paths by the leader_contribution value computed
in cost_seqscan.

Despite my general hatred of back-patching things that cause plan
changes, I'm inclined to think the fix for this should be back-patched
to 9.6, because this is really a brown-paper-bag bug. If the
consensus is otherwise I will of course defer to that consensus.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#9)
1 attachment(s)
Re: plan_rows confusion with parallel queries

On Wed, Jan 11, 2017 at 1:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, it's not *that* consistent. If we were estimating all the numbers
underneath the Gather as being per-worker numbers, that would make some
amount of sense. But neither the other seqscan, nor the hash on it, nor
the hashjoin's output count are scaled that way. It's very hard to call
the above display anything but flat-out broken.

While investigating why Rushabh Lathia's Gather Merge patch sometimes
fails to pick a Gather Merge plan even when it really ought to do so,
I ran smack into this problem. I discovered that this is more than a
cosmetic issue. The costing itself is actually badly broken.

The reason why this is happening is that final_cost_nestloop(),
final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
about whether the path they are generating is partial. They apply the
row estimate for the joinrel itself to every such path generated for
the join, except for parameterized paths which are a special case. I
think this generally has the effect of discouraging parallel joins,
because the inflated row count also inflates the join cost. I think
the right thing to do is probably to scale the row count estimate for
the joinrel's partial paths by the leader_contribution value computed
in cost_seqscan.

Despite my general hatred of back-patching things that cause plan
changes, I'm inclined to think the fix for this should be back-patched
to 9.6, because this is really a brown-paper-bag bug. If the
consensus is otherwise I will of course defer to that consensus.

And here is a patch which seems to fix the problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-join-rows-v1.patchapplication/x-download; name=parallel-join-rows-v1.patchDownload
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a52eb7e..458f139 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -161,6 +161,7 @@ static Selectivity get_foreign_key_join_selectivity(PlannerInfo *root,
 static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
 static double relation_byte_size(double tuples, int width);
 static double page_size(double tuples, int width);
+static double get_parallel_divisor(Path *path);
 
 
 /*
@@ -238,32 +239,7 @@ cost_seqscan(Path *path, PlannerInfo *root,
 	/* Adjust costing for parallelism, if used. */
 	if (path->parallel_workers > 0)
 	{
-		double		parallel_divisor = path->parallel_workers;
-		double		leader_contribution;
-
-		/*
-		 * Early experience with parallel query suggests that when there is
-		 * only one worker, the leader often makes a very substantial
-		 * contribution to executing the parallel portion of the plan, but as
-		 * more workers are added, it does less and less, because it's busy
-		 * reading tuples from the workers and doing whatever non-parallel
-		 * post-processing is needed.  By the time we reach 4 workers, the
-		 * leader no longer makes a meaningful contribution.  Thus, for now,
-		 * estimate that the leader spends 30% of its time servicing each
-		 * worker, and the remainder executing the parallel plan.
-		 */
-		leader_contribution = 1.0 - (0.3 * path->parallel_workers);
-		if (leader_contribution > 0)
-			parallel_divisor += leader_contribution;
-
-		/*
-		 * In the case of a parallel plan, the row count needs to represent
-		 * the number of tuples processed per worker.  Otherwise, higher-level
-		 * plan nodes that appear below the gather will be costed incorrectly,
-		 * because they'll anticipate receiving more rows than any given copy
-		 * will actually get.
-		 */
-		path->rows = clamp_row_est(path->rows / parallel_divisor);
+		double		parallel_divisor = get_parallel_divisor(path);
 
 		/* The CPU cost is divided among all the workers. */
 		cpu_run_cost /= parallel_divisor;
@@ -274,6 +250,12 @@ cost_seqscan(Path *path, PlannerInfo *root,
 		 * prefetching.  For now, we assume that the disk run cost can't be
 		 * amortized at all.
 		 */
+
+		/*
+		 * In the case of a parallel plan, the row count needs to represent
+		 * the number of tuples processed per worker.
+		 */
+		path->rows = clamp_row_est(path->rows / parallel_divisor);
 	}
 
 	path->startup_cost = startup_cost;
@@ -2013,6 +1995,10 @@ final_cost_nestloop(PlannerInfo *root, NestPath *path,
 	else
 		path->path.rows = path->path.parent->rows;
 
+	/* For partial paths, scale row estimate. */
+	if (path->path.parallel_workers > 0)
+		path->path.rows /= get_parallel_divisor(&path->path);
+
 	/*
 	 * We could include disable_cost in the preliminary estimate, but that
 	 * would amount to optimizing for the case where the join method is
@@ -2431,6 +2417,10 @@ final_cost_mergejoin(PlannerInfo *root, MergePath *path,
 	else
 		path->jpath.path.rows = path->jpath.path.parent->rows;
 
+	/* For partial paths, scale row estimate. */
+	if (path->jpath.path.parallel_workers > 0)
+		path->jpath.path.rows /= get_parallel_divisor(&path->jpath.path);
+
 	/*
 	 * We could include disable_cost in the preliminary estimate, but that
 	 * would amount to optimizing for the case where the join method is
@@ -2810,6 +2800,10 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
 	else
 		path->jpath.path.rows = path->jpath.path.parent->rows;
 
+	/* For partial paths, scale row estimate. */
+	if (path->jpath.path.parallel_workers > 0)
+		path->jpath.path.rows /= get_parallel_divisor(&path->jpath.path);
+
 	/*
 	 * We could include disable_cost in the preliminary estimate, but that
 	 * would amount to optimizing for the case where the join method is
@@ -4798,3 +4792,31 @@ page_size(double tuples, int width)
 {
 	return ceil(relation_byte_size(tuples, width) / BLCKSZ);
 }
+
+/*
+ * Estimate the fraction of the work that each worker will do given the
+ * number of workers budgeted for the path.
+ */
+static double
+get_parallel_divisor(Path *path)
+{
+	double		parallel_divisor = path->parallel_workers;
+	double		leader_contribution;
+
+	/*
+	 * Early experience with parallel query suggests that when there is only
+	 * one worker, the leader often makes a very substantial contribution to
+	 * executing the parallel portion of the plan, but as more workers are
+	 * added, it does less and less, because it's busy reading tuples from the
+	 * workers and doing whatever non-parallel post-processing is needed.  By
+	 * the time we reach 4 workers, the leader no longer makes a meaningful
+	 * contribution.  Thus, for now, estimate that the leader spends 30% of
+	 * its time servicing each worker, and the remainder executing the
+	 * parallel plan.
+	 */
+	leader_contribution = 1.0 - (0.3 * path->parallel_workers);
+	if (leader_contribution > 0)
+		parallel_divisor += leader_contribution;
+
+	return parallel_divisor;
+}
#11Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#10)
Re: plan_rows confusion with parallel queries

On Wed, Jan 11, 2017 at 4:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

While investigating why Rushabh Lathia's Gather Merge patch sometimes
fails to pick a Gather Merge plan even when it really ought to do so,
I ran smack into this problem. I discovered that this is more than a
cosmetic issue. The costing itself is actually badly broken.

The reason why this is happening is that final_cost_nestloop(),
final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
about whether the path they are generating is partial. They apply the
row estimate for the joinrel itself to every such path generated for
the join, except for parameterized paths which are a special case. I
think this generally has the effect of discouraging parallel joins,
because the inflated row count also inflates the join cost. I think
the right thing to do is probably to scale the row count estimate for
the joinrel's partial paths by the leader_contribution value computed
in cost_seqscan.

Despite my general hatred of back-patching things that cause plan
changes, I'm inclined to think the fix for this should be back-patched
to 9.6, because this is really a brown-paper-bag bug. If the
consensus is otherwise I will of course defer to that consensus.

And here is a patch which seems to fix the problem.

Since nobody seems to have any comment here, I've committed and
back-patched this to 9.6.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers