Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

Started by Jesper Kroghalmost 15 years ago5 messages
#1Jesper Krogh
jesper@krogh.cc
1 attachment(s)

Hi All.

The NULL element always suprises me in unpleasant ways.. my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing
that instead.

I have a table with a "null_frac" of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where null @@ to_tsquery('testterm80');
id
----
(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80');
QUERY PLAN
---------------------------------------------------------------
Seq Scan on testtable (cost=0.00..1985.03 rows=1966 width=4)
Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the "fts is not null"
clause to the query.

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80') and fts is not null;
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on testtable (cost=130.34..1735.19 rows=983 width=4)
Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS
NOT NULL))
-> Bitmap Index Scan on testtable_fts_idx (cost=0.00..130.09
rows=983 width=0)
Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts
IS NOT NULL))
(4 rows)

When something evaluates to "null" isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the < operator seem to
take the null_frac into
account.

Below snippet allows to reproduce the dataset.

create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
RETURNS boolean
LANGUAGE plpgsql
AS $function$
DECLARE
count integer;
BEGIN
count := 0;
LOOP
EXIT WHEN count = rows;
count := count +1;
insert into testtable(document,fts) select
document,to_tsvector('english',document) from (select
string_agg(concat,' ') as document from (select concat('testterm' ||
generate_series(1,floor(random()*100)::integer))) as foo) as bar;
END LOOP;
RETURN TRUE;
END;
$function$

select filltable(10000);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO: analyzing "public.testtable"
INFO: "testtable": scanned 1835 of 1835 pages, containing 10002 live
rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
null_frac
-----------
0.5
(1 row)

... trying with integers:

testdb=# ALTER TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO: analyzing "public.testtable"
INFO: "testtable": scanned 2186 of 2186 pages, containing 10002 live
rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO: analyzing "public.testtable"
INFO: "testtable": scanned 2282 of 2282 pages, containing 10002 live
rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows
analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on testtable (cost=0.00..2407.03 rows=64 width=4)
Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on testtable (cost=0.00..2407.03 rows=48 width=4)
Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint < 50;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on testtable (cost=0.00..2407.03 rows=2470 width=4)
Filter: (testint < 50)
(2 rows)

(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper

Attachments:

0001-Take-null_frac-into-account-in-calculating-selectivi.patchtext/x-patch; name=0001-Take-null_frac-into-account-in-calculating-selectivi.patchDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jesper Krogh (#1)
Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

Jesper Krogh <jesper@krogh.cc> writes:

When something evaluates to "null" isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions. IOW, aren't you
double-counting the null fraction?

regards, tom lane

#3Jesper Krogh
jesper@krogh.cc
In reply to: Tom Lane (#2)
Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

On 2011-02-17 23:20, Tom Lane wrote:

Jesper Krogh<jesper@krogh.cc> writes:

When something evaluates to "null" isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions. IOW, aren't you
double-counting the null fraction?

It might be the wrong place to fix, but here it seems like we're only
counting MCE-freqs based on non-null elements:

http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

And the testdata confirms the behaviour.

--
Jesper

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jesper Krogh (#3)
Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

Jesper Krogh <jesper@krogh.cc> writes:

On 2011-02-17 23:20, Tom Lane wrote:

The proposed patch seems wrong to me: if we're estimating on the basis
of most-common-value fractions, the null_frac is already accounted for,
because it's not part of the MCV selectivity fractions. IOW, aren't you
double-counting the null fraction?

It might be the wrong place to fix, but here it seems like we're only
counting MCE-freqs based on non-null elements:
http://git.postgresql.org/gitweb?p=postgresql.git;a=blob;f=src/backend/tsearch/ts_typanalyze.c;h=2654d644579fd1959282d83919474f42540ca703;hb=HEAD#l396

Hmm, you're right, and the specification in pg_statistic.h neglects to
say that. This does need work.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jesper Krogh (#1)
Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)

Jesper Krogh <jesper@krogh.cc> writes:

Attached patch tries to align the behaviour

Applied with a bit of editorialization.

regards, tom lane