Why does row estimation on nested loop make no sense to me
On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product of the inner nodes of the nested loop.
However in this case, I am stumped!
explain
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345
Nested Loop (cost=0.00..2903.37 rows=29107 width=4)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4)
Index Cond: (user_id = 10954)
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8)
Index Cond: (parent_entity = ue.entity_id)
How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables being joined?
Not only does it exceed it - but it is orders of magnitude greater.
Am I missing something obvious here? I an see the nested loop row estimate being LESS but certainly not more.
PostgreSQL 9.2.4 on x86_64-pc-solaris2.10, compiled by gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath), 64-bit
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jeff Amiel <becauseimjeff@yahoo.com> writes:
How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables being joined?
Not only does it exceed it - but it is orders of magnitude greater.
Can you provide a self-contained test case that does this?
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product of the inner nodes of the nested loop.
However in this case, I am stumped!explain
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4)
Index Cond: (user_id = 10954)
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8)
Index Cond: (parent_entity = ue.entity_id)How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables being joined?
Not only does it exceed it - but it is orders of magnitude greater.Am I missing something obvious here? I an see the nested loop row estimate being LESS but certainly not more.
Can you also post the output of explain analyze <your-query>?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Can you provide a self-contained test case that does this?
That response scares me.
:)
I can try - Every other table set (small, easy to experiment with) returns results as expected -
Is the implication that this looks 'unusual'?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Friday, May 17, 2013 11:37 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me
On Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product of the inner nodes of the nested loop.
However in this case, I am stumped!explain
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4)
Index Cond: (user_id = 10954)
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8)
Index Cond: (parent_entity = ue.entity_id)How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables being joined?
Not only does it exceed it - but it is orders of magnitude greater.Am I missing something obvious here? I an see the nested loop row estimate being LESS but certainly not more.
Can you also post the output of explain analyze <your-query>?
I'm not worried about performance (per se) but the row estimation issue which propagates up as part of a bigger query. But here ya go:
explain analyze
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345
Nested Loop (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (user_id = 12345)
Heap Fetches: 1
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164 rows=201 loops=1)
Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Sat, May 18, 2013 at 1:47 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Friday, May 17, 2013 11:37 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to meOn Sat, May 18, 2013 at 1:25 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
On most nested loops that I do explain/explain analyze on, the row estimation for the nested-loop itself is a product of the inner nodes of the nested loop.
However in this case, I am stumped!explain
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4)
Index Cond: (user_id = 10954)
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8)
Index Cond: (parent_entity = ue.entity_id)How can the estimated number of rows for the nested loop node EXCEED the product of the 2 row estimates of the tables being joined?
Not only does it exceed it - but it is orders of magnitude greater.Am I missing something obvious here? I an see the nested loop row estimate being LESS but certainly not more.
Can you also post the output of explain analyze <your-query>?
I'm not worried about performance (per se) but the row estimation issue which propagates up as part of a bigger query. But here ya go:
explain analyze
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (user_id = 12345)
Heap Fetches: 1
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164 rows=201 loops=1)
Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms
Have you tried analyze (it's probably a case of insufficient/outdated
statistics to planner's disposal) or probably consider changing
default_statistics_target?
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
explain analyze
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (user_id = 12345)
Heap Fetches: 1
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164 rows=201 loops=1)
Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms
Have you tried analyze (it's probably a case of insufficient/outdated
statistics to planner's disposal) or probably consider changing
default_statistics_target?
Again - my question revolves not around the whether or not I am getting good or bad estimates - my question is related to the fact that the nested-loop row estimation does not appear to be derived from the nodes below it - it is off by orders of magnitude. I've never seen this before.
That aside, yes - I did analyze and tweak stats target during experimentation - no change.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
explain analyze
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (user_id = 12345)
Heap Fetches: 1
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164 rows=201 loops=1)
Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms
I noticed when the explain output in your first mail shows Index Cond:
(user_id = 10954) whereas your query says: ue.user_id=12345. Something
with that? Although, your explain analyze does show the same values at
both places with the row estimate being 29107 in both cases, which,
well, looks awful and quite unexpected though there seem to have been
similar observations before
Have you tried analyze (it's probably a case of insufficient/outdated
statistics to planner's disposal) or probably consider changing
default_statistics_target?Again - my question revolves not around the whether or not I am getting good or bad estimates - my question is related to the fact that the nested-loop row estimation does not appear to be derived from the nodes below it - it is off by orders of magnitude. I've never seen this before.
That aside, yes - I did analyze and tweak stats target during experimentation - no change.
Did you also check select count(*) on both the relations and found
related numbers?
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Friday, May 17, 2013 2:21 PM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me
explain analyze
select era.child_entity from entity_rel era join user_entity ue on ue.entity_id = era.parent_entity and ue.user_id=12345Nested Loop (cost=0.00..2903.37 rows=29107 width=4) (actual time=0.028..0.274 rows=201 loops=1)
-> Index Only Scan using entity_pk on user_entity ue (cost=0.00..62.68 rows=2 width=4) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (user_id = 12345)
Heap Fetches: 1
-> Index Scan using rel_parent on entity_rel era (cost=0.00..1261.85 rows=317 width=8) (actual time=0.013..0.164 rows=201 loops=1)
Index Cond: (parent_entity = ue.entity_id)
Total runtime: 0.361 ms
I noticed when the explain output in your first mail shows Index Cond:
(user_id = 10954) whereas your query says: ue.user_id=12345. Something
with that? Although, your explain analyze does show the same values at
both places with the row estimate being 29107 in both cases, which,
well, looks awful and quite unexpected though there seem to have been
similar observations before
That was a weak attempt at hiding 'real' data - intended to change them all to 12345.
:)
Did you also check select count(*) on both the relations and found
related numbers?
Nothing related (that I could find) on the rowcounts - one table has 20 million rows or so ad the other 65K.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Not sure if it helps, but it's apparently not a very rare thing.
Quick analysis on data from explain.depesz.com showed that > 12% of
plans with nested loop have such estimate.
Couple of examples:
http://explain.depesz.com/s/Qm4
http://explain.depesz.com/s/qmW
http://explain.depesz.com/s/qnG
http://explain.depesz.com/s/QO
http://explain.depesz.com/s/qov
http://explain.depesz.com/s/qqb
http://explain.depesz.com/s/QqF
http://explain.depesz.com/s/qQO
http://explain.depesz.com/s/qrI
http://explain.depesz.com/s/QRK
http://explain.depesz.com/s/QUX9
http://explain.depesz.com/s/QvN
http://explain.depesz.com/s/QWL
http://explain.depesz.com/s/r4F
http://explain.depesz.com/s/R7q
http://explain.depesz.com/s/r8
http://explain.depesz.com/s/R8
http://explain.depesz.com/s/RaB
http://explain.depesz.com/s/RbV
http://explain.depesz.com/s/Rc7
all these plans are public and not anonymized.
depesz
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Ok - I agree -
Can somebody help me understand where the row estimates come from on a nested-loop operation in postgres then?
----- Original Message -----
From: hubert depesz lubaczewski <depesz@depesz.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Saturday, May 18, 2013 3:39 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense
to me
Not sure if it helps, but it's apparently not a very rare thing.
Quick analysis on data from explain.depesz.com showed that > 12% of
plans with nested loop have such estimate.
Couple of examples:
http://explain.depesz.com/s/Qm4
http://explain.depesz.com/s/qmW
http://explain.depesz.com/s/qnG
http://explain.depesz.com/s/QO
http://explain.depesz.com/s/qov
...
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, May 20, 2013 at 11:01 PM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
Ok - I agree -
Can somebody help me understand where the row estimates come from on a nested-loop operation in postgres then?
In case you haven't noticed already in the documentation, there are
following lines:
"... It might appear from inspection of the EXPLAIN output that the
estimate of join rows comes from 50 * 1, that is, the number of outer
rows times the estimated number of rows obtained by each inner index
scan on tenk2. But this is not the case: *the join relation size is
estimated before any particular join plan has been considered*. If
everything is working well then the two ways of estimating the join
size will produce about the same answer, but due to roundoff error and
other factors they sometimes diverge significantly."
Read more at: http://www.postgresql.org/docs/9.2/static/row-estimation-examples.html
It also refers where in source code these table size estimations are done.
Hope this helps.
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I also found one other discussion which has similar issues addressed:
http://postgresql.1045698.n5.nabble.com/Bogus-nestloop-rows-estimate-in-8-4-7-td5710254.html
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thanks much!
(sorry for top-posting, yahoo email sucks)
----- Original Message -----
From: Amit Langote <amitlangote09@gmail.com>
To: Jeff Amiel <becauseimjeff@yahoo.com>
Cc: "depesz@depesz.com" <depesz@depesz.com>; "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Sent: Monday, May 20, 2013 9:51 AM
Subject: Re: [GENERAL] Why does row estimation on nested loop make no sense to me
I also found one other discussion which has similar issues addressed:
http://postgresql.1045698.n5.nabble.com/Bogus-nestloop-rows-estimate-in-8-4-7-td5710254.html
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Tue, May 21, 2013 at 12:43 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote:
Thanks much!
(sorry for top-posting, yahoo email sucks)
I wonder if you could arrive at some conclusions with the statistics
(pg_stats) you have and the join selectivity formulas described in the
referred documentation link. I would like to know if you still get the
same row estimates (after explain) and also if possible, the value
that is computed from that formula. Do they resemble each other?
--
Amit Langote
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general