7.0 like selectivity

Started by Hiroshi Inouealmost 26 years ago5 messages
#1Hiroshi Inoue
Inoue@tpf.co.jp

Hi all,

There was a bug(??) report about LIKE optimization of
7.0 beta3 in Japan from Akira Imagawa.
It may be difficult to solve.

Let t_hoge be a table like
{
hoge_cd int4 primary key,
shimeinn text,
tel text,
..
}
index hoge_ix2 on t_hoge(shimeinn).
index hoge_ix3 on t_hoge(tel).

There are 348236 rows in t_hoge.

For the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012%'
order by hoge_cd
limit 100;

64 rows returned immediately.

And for the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012-3%'
order by hoge_cd
limit 100;

24 rows returned after waiting 8 minutes.

I got the following output from him.
explain select * from t_hoge where tel like '012%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981
width=676)

explain select * from t_hoge where tel like '012-3%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981
width=676)

In fact,count(*) is 342323 and 114741 respectively.

The first problem is that estimated cost is too low.
It seems that the index selectivity of '012-3%' = the index
selectivity of '012%' / (256*256),right ?
If so,does it give more practical estimation than before ?
It doesn't correspond to rows information either.

In reality, * shimeinn like 'imag%' * is much more restrictive
than * tel like '012-3%' *. However I couldn't think of the
way to foresee which is more restrictive. Now I doubt whether
we have enough information to estimate LIKE selectivity
correctly. It's the second problem.

Comments ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#1)
Re: 7.0 like selectivity

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

For the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012%'
order by hoge_cd
limit 100;

64 rows returned immediately.

And for the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012-3%'
order by hoge_cd
limit 100;

24 rows returned after waiting 8 minutes.

So what were the plans for these two queries? Also, has this table been
"vacuum analyzed"?

I got the following output from him.
explain select * from t_hoge where tel like '012%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981
width=676)

explain select * from t_hoge where tel like '012-3%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981
width=676)

In fact,count(*) is 342323 and 114741 respectively.

The first problem is that estimated cost is too low.
It seems that the index selectivity of '012-3%' = the index
selectivity of '012%' / (256*256),right ?
If so,does it give more practical estimation than before ?
It doesn't correspond to rows information either.

The rows number is fairly bogus (because it's coming from application of
eqsel, which is not the right thing; perhaps someday LIKE should have
its very own selectivity estimation function). But the cost estimate
is driven by the estimated selectivity of
tel >= '012-3' AND tel < '012-4'
and it would be nice to think that we have some handle on that.

It could be that the thing is getting fooled by a very non-uniform
distribution of telephone numbers. You indicate that most of the
numbers in the DB begin with '012', but if there are a small number
that begin with digits as high as 9, the selectivity estimates would
be pretty bad.

In reality, * shimeinn like 'imag%' * is much more restrictive
than * tel like '012-3%' *. However I couldn't think of the
way to foresee which is more restrictive. Now I doubt whether
we have enough information to estimate LIKE selectivity
correctly.

No, we don't, because we only keep track of the min and max values
in each column and assume that the data is uniformly distributed
between those limits. Perhaps someday we could keep a histogram
instead --- but VACUUM ANALYZE would get a lot slower and more
complicated ...

regards, tom lane

#3Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#2)
RE: 7.0 like selectivity

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

For the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012%'
order by hoge_cd
limit 100;

64 rows returned immediately.

And for the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012-3%'
order by hoge_cd
limit 100;

24 rows returned after waiting 8 minutes.

So what were the plans for these two queries?

OK,I would ask him to send them.

Also, has this table been
"vacuum analyzed"?

Yes,his another problem was solved by "vacuum analyze".

I got the following output from him.
explain select * from t_hoge where tel like '012%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.23 rows=1981
width=676)

explain select * from t_hoge where tel like '012-3%';
Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1981
width=676)

In fact,count(*) is 342323 and 114741 respectively.

The first problem is that estimated cost is too low.
It seems that the index selectivity of '012-3%' = the index
selectivity of '012%' / (256*256),right ?
If so,does it give more practical estimation than before ?
It doesn't correspond to rows information either.

The rows number is fairly bogus (because it's coming from application of
eqsel, which is not the right thing; perhaps someday LIKE should have
its very own selectivity estimation function). But the cost estimate
is driven by the estimated selectivity of
tel >= '012-3' AND tel < '012-4'
and it would be nice to think that we have some handle on that.

Shouldn't rows number and cost estimate correspond in this case ?
For example,the following query would return same row numbers.
select * from t_hoge where tel = '012';
And the cost estimate is probably > 1000.
Is it good that the cost estimate for "tel like '012%'" is much smaller
than " tel = '012' " ?

PostgreSQL's selectivity doesn't mean a pure probabilty.
For example,for int4 type the pure '=' probabity is pow(2,-32).
Is current cost estimate for " tel>=val1 and tel <val2'" is effective
for narrow range of (val1,val2) ? The range ('012-3','012-4')
is veeeery narrow in the vast char(5) space.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

#4Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#3)
RE: 7.0 like selectivity

I've gotten the plans from Akira Imagawa.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp

-----Original Message-----
From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
Behalf Of Hiroshi Inoue
Sent: Friday, April 07, 2000 8:39 AM
To: Tom Lane
Cc: pgsql-hackers
Subject: RE: [HACKERS] 7.0 like selectivity

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

For the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012%'
order by hoge_cd
limit 100;

64 rows returned immediately.

Sort (cost=0.01..0.01 rows=1 width=28)
-> Index Scan using t_hoge_ix1 on t_hoge (cost=0.00..0.00 rows=1 wid
th=28)

And for the query
select hoge_cd,shimeinn,tel
from t_hoge
where shimeinn like 'imag%'
and tel like '012-3%'
order by hoge_cd
limit 100;

24 rows returned after waiting 8 minutes.

Sort (cost=0.01..0.01 rows=1 width=28)
-> Index Scan using t_hoge_ix3 on t_hoge (cost=0.00..0.00 rows=1 wid
th=28)

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#4)
Re: 7.0 like selectivity

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

I've gotten the plans from Akira Imagawa.

OK ... I assume t_hoge_ix1 is on shimeinn and t_hoge_ix3 is on tel ...

Can we find out what are the min and max values of both the
shimeinn and tel columns?

regards, tom lane