Inconsistency with stemming/stop words in Tsearch2
Hi, having an issue with Tsearch2 and how stop words lexemes are
sometimes being utilized and sometimes not. I would expect the
behavior for to_tsquery for the three variations of "what", "what's"
and "whats" to be consistent (using 'en_stem') and for all variations
to be ignored since they all result in a stop word of "what".
However, this is not the case as to_tsquery("whats") returns the stop
word "what" as a result. Even more confusing is that if one were to
look at the lexize results below, they are inconsistent with the
to_tsquery results below. This seems like a bug to me.
goodrec_2=# select lexize('en_stem', 'what''s');
lexize
--------
{what}
goodrec_2=# select lexize('en_stem', 'whats');
lexize
--------
{what}
goodrec_2=# select lexize('en_stem', 'what');
lexize
--------
{}
goodrec_2=# select to_tsquery('what''s');
NOTICE: query contains only stopword(s) or doesn't contain lexeme
(s), ignored
to_tsquery
goodrec_2=# select to_tsquery('whats');
to_tsquery
------------
'what'
goodrec_2=# select to_tsquery('what');
NOTICE: query contains only stopword(s) or doesn't contain lexeme
(s), ignored
The list of stop-words is user defined, so you can just add 'whats' to
the list. We didn't insert it to the default list, since it's not
frequent as much as 'what'.
btw, you can use ts_debug function to see what's really happens:
=# select * from ts_debug('english','what''s');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | what | {english_stem} | english_stem | {}
blank | Space symbols | ' | {} | |
asciiword | Word, all ASCII | s | {english_stem} | english_stem | {}
(3 rows)
=# select * from ts_debug('english','whats');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | whats | {english_stem} | english_stem | {what}
(1 row)
On Mon, 14 Jul 2008, Yishai Lerner wrote:
Hi, having an issue with Tsearch2 and how stop words lexemes are sometimes
being utilized and sometimes not. I would expect the behavior for to_tsquery
for the three variations of "what", "what's" and "whats" to be consistent
(using 'en_stem') and for all variations to be ignored since they all result
in a stop word of "what". However, this is not the case as
to_tsquery("whats") returns the stop word "what" as a result. Even more
confusing is that if one were to look at the lexize results below, they are
inconsistent with the to_tsquery results below. This seems like a bug to me.goodrec_2=# select lexize('en_stem', 'what''s');
lexize
--------
{what}goodrec_2=# select lexize('en_stem', 'whats');
lexize
--------
{what}goodrec_2=# select lexize('en_stem', 'what');
lexize
--------
{}goodrec_2=# select to_tsquery('what''s');
NOTICE: query contains only stopword(s) or doesn't contain lexeme(s),
ignored
to_tsquerygoodrec_2=# select to_tsquery('whats');
to_tsquery
------------
'what'goodrec_2=# select to_tsquery('what');
NOTICE: query contains only stopword(s) or doesn't contain lexeme(s),
ignored
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83