BUG #16759: Estimation of the planner is wrong for hash join

Started by PG Bug reporting formover 5 years ago6 messagesbugs

noreply@postgresql.org

over 5 years ago

The following bug has been logged on the website:

Bug reference: 16759
Logged by: Bertrand Guillaumin
Email address: bertrand.guillaumin@gmail.com
PostgreSQL version: 11.8
Operating system: Linux
Description:

The following query estimated number of lines returned is 1 while it should
be around 67 or more :
explain analyze select * from enterprise where parent_enterprise in (select
enterprise_id from enterprise par where global_attribute15 = 'BEL');
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Join (cost=95.60..191.33 rows=1 width=977) (actual time=0.422..0.896
rows=56 loops=1)
Hash Cond: (enterprise.parent_enterprise = par.enterprise_id)
-> Seq Scan on enterprise (cost=0.00..92.87 rows=1087 width=977)
(actual time=0.004..0.095 rows=1087 loops=1)
-> Hash (cost=95.59..95.59 rows=1 width=5) (actual time=0.397..0.398
rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on enterprise par (cost=0.00..95.59 rows=1 width=5)
(actual time=0.251..0.394 rows=1 loops=1)
Filter: (global_attribute15 = 'BEL'::text)
Rows Removed by Filter: 1086

The same issue with a normal join :
explain analyze select * from enterprise ent1, enterprise ent2 where
ent1.parent_enterprise=ent2.enterprise_id and ent2.global_attribute15 =
'BEL';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=95.60..191.33 rows=1 width=1954) (actual time=0.444..0.954
rows=56 loops=1)
Hash Cond: (ent1.parent_enterprise = ent2.enterprise_id)
-> Seq Scan on enterprise ent1 (cost=0.00..92.87 rows=1087 width=977)
(actual time=0.004..0.104 rows=1087 loops=1)
-> Hash (cost=95.59..95.59 rows=1 width=977) (actual time=0.416..0.416
rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on enterprise ent2 (cost=0.00..95.59 rows=1
width=977) (actual time=0.252..0.410 rows=1 loops=1)
Filter: (global_attribute15 = 'BEL'::text)
Rows Removed by Filter: 1086

The statistics for parent_enterprise :
select * from pg_stats where tablename='enterprise' and
attname='parent_enterprise';
schemaname | tablename | attname | inherited | null_frac |
avg_width | n_distinct | most_common_vals
|
most_common_freqs |
histogram_bounds | correlation | most_common_elems |
most_common_elem_freqs | elem_count_histogram
------------+------------+-------------------+-----------+------------+-----------+------------+-----------------------------------------------------+-------------------------------------------------
-------------------------------------------------------------------+-------------------------+-------------+-------------------+------------------------+----------------------
xxxxxxx | enterprise | parent_enterprise | f | 0.00551978 |
4 | 16 | {48,682,6162,6639,6448,46,6630,6796,6553,6812,6854} |
{0.551058,0.184913,0.0818767,0.0772769,0.0515179
,0.0137994,0.00919963,0.00827967,0.00643974,0.00367985,0.00183993} |
{0,6831,6853,6904,6917} | 0.755042 | |
|
(1 row)

The right estimation when using = (not an option here):
explain select * from enterprise where parent_enterprise = (select
enterprise_id from enterprise par where global_attribute15 = 'BEL' limit
1);
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on enterprise (cost=95.59..191.18 rows=68 width=977)
Filter: (parent_enterprise = $0)
InitPlan 1 (returns $0)
-> Seq Scan on enterprise par (cost=0.00..95.59 rows=1 width=5)
Filter: (global_attribute15 = 'BEL'::text)

The wrong estimation leads to issues in bigger queries.

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16759: Estimation of the planner is wrong for hash join

PG Bug reporting form <noreply@postgresql.org> writes:

The following query estimated number of lines returned is 1 while it should
be around 67 or more :
explain analyze select * from enterprise where parent_enterprise in (select
enterprise_id from enterprise par where global_attribute15 = 'BEL');

I failed to reproduce this result, which is unsurprising given the
lack of context. Can you provide a self-contained example?

I think that highly accurate estimates are unlikely in this situation,
but it is odd that "parent_enterprise in (select...)" is estimating
fewer rows than "parent_enterprise = something".

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16759: Estimation of the planner is wrong for hash join

[ please keep the list cc'd ]

Bertrand Guillaumin <bertrand.guillaumin@gmail.com> writes:

I've managed to reproduce the bug but only in the join case not with the in
subquery (it uses hash join semi which works) with the following test case
:
create table work.test_bug_hash as SELECT id, case when id <= 600 then 100
when id <=800 then 200 when id <=875 then 300 when id<=950 then 400
when id<=960 then 500 when id<=970 then 600 when id<=980 then 700 when
id<=990 then 800 when id =991 then 900 when id=992 then 910 when id=993
then 920 when id=994 then 930 when id=995 then 940 when id=996 then 950
when id=997 then 960 when id =998 then 970 when id=999 then 980 else 990
end as parent_id, null as attrib from (select generate_series(1,1000) as
id) alias0;

update work.test_bug_hash set attrib='BEL' where id=300;

analyze work.test_bug_hash;

explain select * from work.test_bug_hash a, work.test_bug_hash b where
a.parent_id=b.id and b.attrib='BEL';

Hmm. Trying this on HEAD, the join and IN forms both estimate rows=1,
which is no doubt because recent versions of eqjoinsel() clamp the
semijoin selectivity estimate to be not more than the plain join
selectivity estimate ... which is logically correct, but in this case
it replaces a somewhat-okay estimate with a not-very-good one.

Anyway, the estimate you're getting from the "a = (select ...)" form
is the responsibility of var_eq_non_const, which has no idea what value
might come out of the sub-select, so it falls back to this logic:

/*
* Search is for a value that we do not know a priori, but we will
* assume it is not NULL. Estimate the selectivity as non-null
* fraction divided by number of distinct values, so that we get a
* result averaged over all possible values whether common or
* uncommon. (Essentially, we are assuming that the not-yet-known
* comparison value is equally likely to be any of the possible
* values, regardless of their frequency in the table. Is that a good
* idea?)
*/

Meanwhile, in the join or semijoin cases, the issue is that
eqjoinsel_inner has an MCV list for the parent_id side, but not
for the id side (because the latter is unique so it has no MCVs).
So it falls back on this logic:

/*
* We do not have MCV lists for both sides. Estimate the join
* selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2). This
* is plausible if we assume that the join operator is strict and the
* non-null values are about equally distributed: a given non-null
* tuple of rel1 will join to either zero or N2*(1-nullfrac2)/nd2 rows
* of rel2, so total join rows are at most
* N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join selectivity of
* not more than (1-nullfrac1)*(1-nullfrac2)/nd2. By the same logic it
* is not more than (1-nullfrac1)*(1-nullfrac2)/nd1, so the expression
* with MIN() is an upper bound. Using the MIN() means we estimate
* from the point of view of the relation with smaller nd (since the
* larger nd is determining the MIN). It is reasonable to assume that
* most tuples in this rel will have join partners, so the bound is
* probably reasonably tight and should be taken as-is.
*
* XXX Can we be smarter if we have an MCV list for just one side? It
* seems that if we assume equal distribution for the other side, we
* end up with the same answer anyway.
*/

In the case at hand, with nd1=18, nd2=1000, we'll come out with a
selectivity of 1/1000 which results in nrows = 1.

Maybe it'd be better to do something else here, but I'm not sure what.
All of these stats-free estimates are just rules of thumb and sometimes
go wrong. Still, the case of one side of the join being unique and
the other not has to be pretty common, so it'd be nice to make it better.

regards, tom lane

Import Notes

Reply to msg id not found: CAC-tRezpoX=Jn7XGJz4KM6=o_3+RKgoH3BQigf9bBbRy7Opzeg@mail.gmail.com

Bertrand Guillaumin

bertrand.guillaumin@gmail.com

over 5 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #16759: Estimation of the planner is wrong for hash join

CC the bug mailing list.

In short the number of distinct values in the calculation should not be the
one in the statistics but be derived from the estimated size of the sample
if constant values are used.

Le mer. 16 déc. 2020 à 16:18, Bertrand Guillaumin <
bertrand.guillaumin@gmail.com> a écrit :

Show quoted text

Hello, so I've made some researches on how does this case is dealt with in
Oracle and found the following document :

https://www.doag.org/formes/pubfiles/6315126/2014-DB-Jonathan_Lewis-Calculating_Join_Selectivity-Manuskript.pdf

The issue is the number of distinct values for nd2, as we know that the
number of distinct values for b.id is going to be 1 (because of filtering
on attrib), not 1000.

In the document they explain the following for calculation of the number
of distinct values on a table in a join.

In a query like this :
select * from t1,t2
where T1.mod_300 = t2.mod_200
and t1.date_1000=<constant>

the selectivity is this :
t1 has 1,000,000 rows (number of rows) call this nr
mod_300 has 300 distinct values (number of distinct) call this nd
date_1000 = {constant} returns a sample of 1,000 rows call this s
The expected number of distinct values for mod_300 in the sample will be:
nd * (1 - power(1 - s/nr, nr/nd)) In our case, 300 * (1 - power(1 -
1000/1000000, 1000000/1000)) = 289.3156

If we apply the same reasoning to the query I posted we would get
SELECT * FROM T1,T2
where T1.id = T2.parent_id
and T1.attrib='BEL'
we get :
nr =1000
nd = 1000
s = 1
so expected number of distinct values 1000* (1 - (1-1/1000)^(1000/1000)) =
1000*(1-0,999)=1
With nd2 corrected like this the result should be better, as the
calculated selectivity is then 1*1000*(1/18)=56 far more close to the
reality than before

Hope this can help,
best regards,

Le mer. 2 déc. 2020 à 23:33, Tom Lane <tgl@sss.pgh.pa.us> a écrit :

[ please keep the list cc'd ]

Bertrand Guillaumin <bertrand.guillaumin@gmail.com> writes:

I've managed to reproduce the bug but only in the join case not with

the in

subquery (it uses hash join semi which works) with the following test

case

:
create table work.test_bug_hash as SELECT id, case when id <= 600 then

100

when id <=800 then 200 when id <=875 then 300 when id<=950 then 400
when id<=960 then 500 when id<=970 then 600 when id<=980 then 700

when

id<=990 then 800 when id =991 then 900 when id=992 then 910 when id=993
then 920 when id=994 then 930 when id=995 then 940 when id=996 then 950
when id=997 then 960 when id =998 then 970 when id=999 then 980 else 990
end as parent_id, null as attrib from (select generate_series(1,1000) as
id) alias0;

update work.test_bug_hash set attrib='BEL' where id=300;

analyze work.test_bug_hash;

explain select * from work.test_bug_hash a, work.test_bug_hash b where
a.parent_id=b.id and b.attrib='BEL';

Hmm. Trying this on HEAD, the join and IN forms both estimate rows=1,
which is no doubt because recent versions of eqjoinsel() clamp the
semijoin selectivity estimate to be not more than the plain join
selectivity estimate ... which is logically correct, but in this case
it replaces a somewhat-okay estimate with a not-very-good one.

Anyway, the estimate you're getting from the "a = (select ...)" form
is the responsibility of var_eq_non_const, which has no idea what value
might come out of the sub-select, so it falls back to this logic:

/*
* Search is for a value that we do not know a priori, but we will
* assume it is not NULL. Estimate the selectivity as non-null
* fraction divided by number of distinct values, so that we get a
* result averaged over all possible values whether common or
* uncommon. (Essentially, we are assuming that the not-yet-known
* comparison value is equally likely to be any of the possible
* values, regardless of their frequency in the table. Is that a
good
* idea?)
*/

Meanwhile, in the join or semijoin cases, the issue is that
eqjoinsel_inner has an MCV list for the parent_id side, but not
for the id side (because the latter is unique so it has no MCVs).
So it falls back on this logic:

/*
* We do not have MCV lists for both sides. Estimate the join
* selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2).
This
* is plausible if we assume that the join operator is strict and
the
* non-null values are about equally distributed: a given non-null
* tuple of rel1 will join to either zero or N2*(1-nullfrac2)/nd2
rows
* of rel2, so total join rows are at most
* N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join
selectivity of
* not more than (1-nullfrac1)*(1-nullfrac2)/nd2. By the same
logic it
* is not more than (1-nullfrac1)*(1-nullfrac2)/nd1, so the
expression
* with MIN() is an upper bound. Using the MIN() means we
estimate
* from the point of view of the relation with smaller nd (since
the
* larger nd is determining the MIN). It is reasonable to assume
that
* most tuples in this rel will have join partners, so the bound
is
* probably reasonably tight and should be taken as-is.
*
* XXX Can we be smarter if we have an MCV list for just one
side? It
* seems that if we assume equal distribution for the other side,
we
* end up with the same answer anyway.
*/

In the case at hand, with nd1=18, nd2=1000, we'll come out with a
selectivity of 1/1000 which results in nrows = 1.

Maybe it'd be better to do something else here, but I'm not sure what.
All of these stats-free estimates are just rules of thumb and sometimes
go wrong. Still, the case of one side of the join being unique and
the other not has to be pretty common, so it'd be nice to make it better.

regards, tom lane

Import Notes

Reply to msg id not found: CAC-tRewx4SOyU6jrGb5KpWq-8Njre6NP-54Y-7+3H31B6W2FLg@mail.gmail.com

Bertrand Guillaumin

bertrand.guillaumin@gmail.com

over 5 years ago

In reply to: Bertrand Guillaumin (#4)

Re: BUG #16759: Estimation of the planner is wrong for hash join

Hello,
I think I just made a test that shows that even with MCV on both sides the
estimated selectivity can be pretty wrong.
test=# create table test_bug_hash2 as SELECT mod(id,500) as id , case when
id<=500 then 1 else 2 end parent_id, null::text as attrib from (select
generate_series(1,1000) as id) alias0;
SELECT 1000
test=# update test_bug_hash2 set attrib='BEL' where id=2;
UPDATE 2
test=# analyze test_bug_hash2;
ANALYZE
test=# explain select * from test_bug_hash2 a, test_bug_hash2 b where
a.parent_id=b.id and b.attrib='BEL';
QUERY PLAN
------------------------------------------------------------------------------
Hash Join (cost=17.52..37.56 rows=4 width=24)
Hash Cond: (a.parent_id = b.id)
-> Seq Scan on test_bug_hash2 a (cost=0.00..15.00 rows=1000 width=12)
-> Hash (cost=17.50..17.50 rows=2 width=12)
-> Seq Scan on test_bug_hash2 b (cost=0.00..17.50 rows=2
width=12)
Filter: (attrib = 'BEL'::text)

test=# select count(*) from test_bug_hash2 a, test_bug_hash2 b where
a.parent_id=b.id and b.attrib='BEL';
count
-------
1000

I won't copy paste the pg_stats lines but most_common_vals and
most_common_freq have values for all three columns.

I'm not a programmer but I've looked into the code of the planner a little
bit and it seems you try to estimate the selectivity of a join in itself
without any regards to the filters that can be applied on any side of the
join ( if I understood correctly).
I think that it's ultimately where the problem lies, if the filter is not
too important the selectivity stays more or less the same but with filters
like the one in this query the selectivity of the join can change a lot so
in the end you get estimations which can be totally wrong.

Le mer. 16 déc. 2020 à 16:27, Bertrand Guillaumin <
bertrand.guillaumin@gmail.com> a écrit :

Show quoted text

CC the bug mailing list.

In short the number of distinct values in the calculation should not be
the one in the statistics but be derived from the estimated size of the
sample if constant values are used.

Le mer. 16 déc. 2020 à 16:18, Bertrand Guillaumin <
bertrand.guillaumin@gmail.com> a écrit :

Hello, so I've made some researches on how does this case is dealt with
in Oracle and found the following document :

https://www.doag.org/formes/pubfiles/6315126/2014-DB-Jonathan_Lewis-Calculating_Join_Selectivity-Manuskript.pdf

The issue is the number of distinct values for nd2, as we know that the
number of distinct values for b.id is going to be 1 (because of
filtering on attrib), not 1000.

In the document they explain the following for calculation of the number
of distinct values on a table in a join.

In a query like this :
select * from t1,t2
where T1.mod_300 = t2.mod_200
and t1.date_1000=<constant>

the selectivity is this :
t1 has 1,000,000 rows (number of rows) call this nr
mod_300 has 300 distinct values (number of distinct) call this nd
date_1000 = {constant} returns a sample of 1,000 rows call this s
The expected number of distinct values for mod_300 in the sample will be:
nd * (1 - power(1 - s/nr, nr/nd)) In our case, 300 * (1 - power(1 -
1000/1000000, 1000000/1000)) = 289.3156

If we apply the same reasoning to the query I posted we would get
SELECT * FROM T1,T2
where T1.id = T2.parent_id
and T1.attrib='BEL'
we get :
nr =1000
nd = 1000
s = 1
so expected number of distinct values 1000* (1 - (1-1/1000)^(1000/1000))
= 1000*(1-0,999)=1
With nd2 corrected like this the result should be better, as the
calculated selectivity is then 1*1000*(1/18)=56 far more close to the
reality than before

Hope this can help,
best regards,

Le mer. 2 déc. 2020 à 23:33, Tom Lane <tgl@sss.pgh.pa.us> a écrit :

[ please keep the list cc'd ]

Bertrand Guillaumin <bertrand.guillaumin@gmail.com> writes:

I've managed to reproduce the bug but only in the join case not with

the in

subquery (it uses hash join semi which works) with the following test

case

:
create table work.test_bug_hash as SELECT id, case when id <= 600 then

100

when id <=800 then 200 when id <=875 then 300 when id<=950 then 400
when id<=960 then 500 when id<=970 then 600 when id<=980 then 700

when

id<=990 then 800 when id =991 then 900 when id=992 then 910 when id=993
then 920 when id=994 then 930 when id=995 then 940 when id=996 then 950
when id=997 then 960 when id =998 then 970 when id=999 then 980 else

990

end as parent_id, null as attrib from (select generate_series(1,1000)

as

id) alias0;

update work.test_bug_hash set attrib='BEL' where id=300;

analyze work.test_bug_hash;

explain select * from work.test_bug_hash a, work.test_bug_hash b where
a.parent_id=b.id and b.attrib='BEL';

Hmm. Trying this on HEAD, the join and IN forms both estimate rows=1,
which is no doubt because recent versions of eqjoinsel() clamp the
semijoin selectivity estimate to be not more than the plain join
selectivity estimate ... which is logically correct, but in this case
it replaces a somewhat-okay estimate with a not-very-good one.

Anyway, the estimate you're getting from the "a = (select ...)" form
is the responsibility of var_eq_non_const, which has no idea what value
might come out of the sub-select, so it falls back to this logic:

/*
* Search is for a value that we do not know a priori, but we
will
* assume it is not NULL. Estimate the selectivity as non-null
* fraction divided by number of distinct values, so that we get
a
* result averaged over all possible values whether common or
* uncommon. (Essentially, we are assuming that the
not-yet-known
* comparison value is equally likely to be any of the possible
* values, regardless of their frequency in the table. Is that
a good
* idea?)
*/

Meanwhile, in the join or semijoin cases, the issue is that
eqjoinsel_inner has an MCV list for the parent_id side, but not
for the id side (because the latter is unique so it has no MCVs).
So it falls back on this logic:

/*
* We do not have MCV lists for both sides. Estimate the join
* selectivity as MIN(1/nd1,1/nd2)*(1-nullfrac1)*(1-nullfrac2).
This
* is plausible if we assume that the join operator is strict
and the
* non-null values are about equally distributed: a given
non-null
* tuple of rel1 will join to either zero or
N2*(1-nullfrac2)/nd2 rows
* of rel2, so total join rows are at most
* N1*(1-nullfrac1)*N2*(1-nullfrac2)/nd2 giving a join
selectivity of
* not more than (1-nullfrac1)*(1-nullfrac2)/nd2. By the same
logic it
* is not more than (1-nullfrac1)*(1-nullfrac2)/nd1, so the
expression
* with MIN() is an upper bound. Using the MIN() means we
estimate
* from the point of view of the relation with smaller nd (since
the
* larger nd is determining the MIN). It is reasonable to
assume that
* most tuples in this rel will have join partners, so the bound
is
* probably reasonably tight and should be taken as-is.
*
* XXX Can we be smarter if we have an MCV list for just one
side? It
* seems that if we assume equal distribution for the other
side, we
* end up with the same answer anyway.
*/

In the case at hand, with nd1=18, nd2=1000, we'll come out with a
selectivity of 1/1000 which results in nrows = 1.

Maybe it'd be better to do something else here, but I'm not sure what.
All of these stats-free estimates are just rules of thumb and sometimes
go wrong. Still, the case of one side of the join being unique and
the other not has to be pretty common, so it'd be nice to make it better.

regards, tom lane

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 5 years ago

In reply to: Bertrand Guillaumin (#5)

Re: BUG #16759: Estimation of the planner is wrong for hash join

On 12/17/20 6:36 PM, Bertrand Guillaumin wrote:

Hello,
I think I just made a test that shows that even with MCV on both sides
the estimated selectivity can be pretty wrong.
test=# create table test_bug_hash2 as SELECT mod(id,500) as id , case
when id<=500 then 1 else 2 end parent_id, null::text as attrib from
(select generate_series(1,1000) as id) alias0;
SELECT 1000
test=# update test_bug_hash2 set attrib='BEL' where id=2;
UPDATE 2
test=# analyze test_bug_hash2;
ANALYZE
test=# explain select * from test_bug_hash2 a, test_bug_hash2 b where
a.parent_id=b.id <http://b.id> and b.attrib='BEL';
QUERY PLAN
------------------------------------------------------------------------------
Hash Join (cost=17.52..37.56 rows=4 width=24)
Hash Cond: (a.parent_id = b.id <http://b.id>)
-> Seq Scan on test_bug_hash2 a (cost=0.00..15.00 rows=1000 width=12)
-> Hash (cost=17.50..17.50 rows=2 width=12)
-> Seq Scan on test_bug_hash2 b (cost=0.00..17.50 rows=2
width=12)
Filter: (attrib = 'BEL'::text)

test=# select count(*) from test_bug_hash2 a, test_bug_hash2 b where
a.parent_id=b.id <http://b.id> and b.attrib='BEL';
count
-------
1000

I won't copy paste the pg_stats lines but most_common_vals and
most_common_freq have values for all three columns.

I'm not a programmer but I've looked into the code of the planner a
little bit and it seems you try to estimate the selectivity of a join in
itself without any regards to the filters that can be applied on any
side of the join ( if I understood correctly).
I think that it's ultimately where the problem lies, if the filter is
not too important the selectivity stays more or less the same but with
filters like the one in this query the selectivity of the join can
change a lot so in the end you get estimations which can be totally wrong.

Right. The problem is that the two columns are correlated, thanks to how
you set the attrib value only for id=2, but the join estimation code is
oblivious to that.

Perhaps the multi-column/extended stats might allow us to improve this,
but the code has not been written yet.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company