Selectivity of "=" (Re: [HACKERS] Index not used on simple select)
I wrote
(Note to hackers: Ole sent me a 1000-row test case off list.)
oletest=> explain select * from av_parts where partnumber = '123456';
NOTICE: QUERY PLAN:Index Scan using av_parts_partnumber_index on av_parts (cost=2.04 rows=1
width=124)EXPLAIN
oletest=> explain select * from av_parts where nsn = '123456';
NOTICE: QUERY PLAN:Seq Scan on av_parts (cost=48.00 rows=995 width=124)
It looks like the highly skewed distribution of nsn values (what you
sent me had 997 '' entries, only 3 non-empty strings) is confusing the
selectivity estimation code somehow, such that the system thinks that
the query is going to match most of the rows. Notice it is estimating
995 returned rows for the nsn select! Under these circumstances it will
prefer a sequential scan, since the more-expensive-per-tuple index scan
doesn't look like it will be able to avoid reading most of the table.
That logic is OK, it's the 0.995 selectivity estimate that's wrong...
It turns out that the selectivity estimate for an "=" comparison is just
the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.
Indeed, if one were to do a "select * from av_parts where nsn = ''",
then sequential scan would be the most efficient way to do that.
The system has no clue that that's not really something you'd do much.
By using this estimate, the system is effectively assuming that the
frequency of occurrence of values in the table corresponds to the
probability that you will be searching for each particular value.
So, the selectivity that a search for the most common value would
have is a reasonable estimate for the selectivity of a search for any
value. That's a bogus assumption in this case --- but it's hard to
justify making any other assumption in general.
My inclination is to hack up eqsel() to never return a selectivity
estimate larger than, say, 0.5, even when the measured disbursion
is more. I am not sure that this is a good idea, however. Comments?
regards, tom lane
Import Notes
Reply to msg id not found: YourmessageofFri23Jul1999120309-040014783.932745789@sss.pgh.pa.us
It looks like the highly skewed distribution of nsn values (what you
sent me had 997 '' entries, only 3 non-empty strings) is confusing the
selectivity estimation code somehow, such that the system thinks that
the query is going to match most of the rows. Notice it is estimating
995 returned rows for the nsn select! Under these circumstances it will
prefer a sequential scan, since the more-expensive-per-tuple index scan
doesn't look like it will be able to avoid reading most of the table.
That logic is OK, it's the 0.995 selectivity estimate that's wrong...It turns out that the selectivity estimate for an "=" comparison is just
the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.
Yes, you are correct, though it does look at potentially one or two
other unique values, depending on the distribution. It basically
perfectly computes disbursion for unique columns, and columns that
contain only two unique values, and it figures in NULL. In other cases,
the disbursion is imperfect, but pretty decent.
My inclination is to hack up eqsel() to never return a selectivity
estimate larger than, say, 0.5, even when the measured disbursion
is more. I am not sure that this is a good idea, however. Comments?
I would discourage this. I can imagine many cases there >0.5
selectivites would be valid, i.e. state = "PA".
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote:
It turns out that the selectivity estimate for an "=" comparison is
justthe attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.Indeed, if one were to do a "select * from av_parts where nsn = ''",
then sequential scan would be the most efficient way to do that.
The system has no clue that that's not really something you'd do much.Does the system currently index NULLs as well ?
I suspect supporting partial indexes (initially just non-NULLs) would
let us have much better and also use indexes intelligently for
mostly-NULL
columns.Perhaps a line like
* Add partial index support
would fit in TODO
-----------------
Hannu
Yes, I think we index nulls. What are partial indexes?
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: 379E4B40.A9EF5A06@trust.ee | Resolved by subject fallback
Tom Lane wrote:
It turns out that the selectivity estimate for an "=" comparison is
just
the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.Indeed, if one were to do a "select * from av_parts where nsn = ''",
then sequential scan would be the most efficient way to do that.
The system has no clue that that's not really something you'd do much.
Does the system currently index NULLs as well ?
I suspect supporting partial indexes (initially just non-NULLs) would
let us have much better and also use indexes intelligently for
mostly-NULL
columns.
Perhaps a line like
* Add partial index support
would fit in TODO
-----------------
Hannu
Yes, I think we index nulls. What are partial indexes?
http://www.PostgreSQL.ORG/docs/postgres/partial-index.htm
Postgres had support for this at one time, and probably still does.
- Thomas
--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California
Yes, I think we index nulls. What are partial indexes?
http://www.PostgreSQL.ORG/docs/postgres/partial-index.htm
Postgres had support for this at one time, and probably still does.
Wow, that's really nice writing. I think Tom Lane and I ripped that
stuff out of the optimizer. (Only kidding.)
Not sure if any of it works.
--
Bruce Momjian | http://www.op.net/~candle
maillist@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026