planner row-estimates for tsvector seems horribly wrong

Started by Sushant Sinhaabout 15 years ago4 messages
#1Sushant Sinha
sushant354@gmail.com

I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

I see a similar problem reported here but thought it was fixed in 9.0
which I am running.

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01389.php

Here is the version info and detailed planner output for all the three
queries:

select version();

version

PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(Gentoo 4.3.4 p1.1, pie-10.1.5) 4.3.4, 64-bit

Case I: FOR A NON-MATCHING WORD
===============================

explain analyze select count(*) from docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where docvector @@ qdoc;
QUERY
PLAN

Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual
time=0.058..0.058 rows=1 loops=1)
-> Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=0.055..0.055 rows=0 loops=1)
-> Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32)
(actual time=0.005..0.005 rows=1 loops=1)
-> Bitmap Heap Scan on docmeta (cost=5300.28..20249.51
rows=4843 width=270) (actual time=0.046..0.046 rows=0 loops=1)
Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
-> Bitmap Index Scan on doc_index (cost=0.00..5299.07
rows=4843 width=0) (actual time=0.044..0.044 rows=0 loops=1)
Index Cond: (docmeta.docvector @@ qdoc.qdoc)
Total runtime: 0.092 ms

CASE II: FOR A MEDIUM-MATCHING WORD
===================================
explain analyze select count(*) from docmeta,
plainto_tsquery('english', 'quit') as qdoc where docvector @@ qdoc;
QUERY
PLAN

Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual
time=1222.856..1222.857 rows=1 loops=1)
-> Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=639.275..1212.460 rows=88545 loops=1)
-> Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32)
(actual time=0.006..0.007 rows=1 loops=1)
-> Bitmap Heap Scan on docmeta (cost=5300.28..20249.51
rows=4843 width=270) (actual time=639.264..1196.542 rows=88545 loops=1)
Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
-> Bitmap Index Scan on doc_index (cost=0.00..5299.07
rows=4843 width=0) (actual time=621.877..621.877 rows=88545 loops=1)
Index Cond: (docmeta.docvector @@ qdoc.qdoc)
Total runtime: 1222.907 ms

Case II: FOR A HIGH-MATCHING WORD
=================================

explain analyze select count(*) from docmeta,
plainto_tsquery('english', 'j') as qdoc where docvector @@ qdoc;
QUERY
PLAN

Aggregate (cost=20322.17..20322.18 rows=1 width=0) (actual
time=742.857..742.858 rows=1 loops=1)
-> Nested Loop (cost=5300.28..20310.06 rows=4843 width=0) (actual
time=126.804..660.895 rows=726985 loops=1)
-> Function Scan on qdoc (cost=0.00..0.01 rows=1 width=32)
(actual time=0.004..0.006 rows=1 loops=1)
-> Bitmap Heap Scan on docmeta (cost=5300.28..20249.51
rows=4843 width=270) (actual time=126.795..530.422 rows=726985 loops=1)
Recheck Cond: (docmeta.docvector @@ qdoc.qdoc)
-> Bitmap Index Scan on doc_index (cost=0.00..5299.07
rows=4843 width=0) (actual time=113.742..113.742 rows=726985 loops=1)
Index Cond: (docmeta.docvector @@ qdoc.qdoc)
Total runtime: 742.906 ms

Thanks,
Sushant.

#2Jan Urbański
wulczer@wulczer.org
In reply to: Sushant Sinha (#1)
Re: planner row-estimates for tsvector seems horribly wrong

On 24/10/10 14:44, Sushant Sinha wrote:

I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

I see a similar problem reported here but thought it was fixed in 9.0
which I am running.

Hi,

what's your default statistics target? Could you attach the output of

select * from pg_stats where tablename = '<the-tablename>';

and

select typanalyze from pg_type where typname = 'tsvector';

Thanks,
Jan

#3Jan Urbański
wulczer@wulczer.org
In reply to: Sushant Sinha (#1)
Re: planner row-estimates for tsvector seems horribly wrong

On 24/10/10 14:44, Sushant Sinha wrote:

I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

explain analyze select count(*) from docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where docvector @@ qdoc;

OK, forget my previous message. The problem is that you are doing a join
using @@ as the operator for the join condition, so the planner uses the
operator's join selectivity estimate. For @@ the tsmatchjoinsel function
simply returns 0.005.

Try doing:

explain analyze select count(*) from docmeta where docvector @@
plainto_tsquery('english', 'dyfdfdf');

It should help.

Cheers,
Jan

#4Sushant Sinha
sushant354@gmail.com
In reply to: Jan Urbański (#3)
Re: planner row-estimates for tsvector seems horribly wrong

Thanks a ton Jan! It works quite correctly. But many tsearch tutorials
ask tsquery to be placed in 'from' statement and that can cause bad
plan. Isn't it possible to return the correct number for a join with the
query as well?

-Sushant.

Show quoted text

On Sun, 2010-10-24 at 15:07 +0200, Jan Urbański wrote:

On 24/10/10 14:44, Sushant Sinha wrote:

I am using gin index on a tsvector and doing basic search. I see the
row-estimate of the planner to be horribly wrong. It is returning
row-estimate as 4843 for all queries whether it matches zero rows, a
medium number of rows (88,000) or a large number of rows (726,000).

The table has roughly a million docs.

explain analyze select count(*) from docmeta,
plainto_tsquery('english', 'dyfdfdf') as qdoc where docvector @@ qdoc;

OK, forget my previous message. The problem is that you are doing a join
using @@ as the operator for the join condition, so the planner uses the
operator's join selectivity estimate. For @@ the tsmatchjoinsel function
simply returns 0.005.

Try doing:

explain analyze select count(*) from docmeta where docvector @@
plainto_tsquery('english', 'dyfdfdf');

It should help.

Cheers,
Jan