BUG #6327: Prefix full-text-search fails for hosts with complicated names
The following bug has been logged on the website:
Bug reference: 6327
Logged by: Marcin Kasperski
Email address: Marcin.Kasperski@mekk.waw.pl
PostgreSQL version: 9.1.1
Operating system: Linux
Description:
Synopsis
=========
'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com
Example SQL
=============
Try the queries below. Note ismatch column, which is t in the former, and f
in the latter case (IMHO should be t in both).
SELECT a query, b message, a@@b ismatch FROM (
SELECT TO_TSQUERY('english', 'goog:*') a,
TO_TSVECTOR('english', 'See google.com') b) as foo;
SELECT a query, b message, a@@b ismatch FROM (
SELECT TO_TSQUERY('english', 'e-goog:*') a,
TO_TSVECTOR('english', 'See e-google.com') b) as foo;
On 05-12-2011 09:40, Marcin.Kasperski@mekk.waw.pl wrote:
'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com
It is a known limitation. The text search parser ignores some uncommon cases.
See TODO and archives.
--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Marcin.Kasperski@mekk.waw.pl writes:
Synopsis
=========
'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com
The reason for this seems to be that the pattern is treated as a
hyphenated word:
regression=# select TO_TSQUERY('english', 'e-goog:*');
to_tsquery
-------------------------------
'e-goog':* & 'e':* & 'goog':*
(1 row)
but the hostname isn't:
regression=# select TO_TSVECTOR('english', 'See e-google.com');
to_tsvector
--------------------------
'e-google.com':2 'see':1
(1 row)
If you change the text so it's not recognized as a hostname, you get
lexemes that would match the query:
regression=# select TO_TSVECTOR('english', 'See e-google com');
to_tsvector
---------------------------------------------
'com':5 'e':3 'e-googl':2 'googl':4 'see':1
(1 row)
Possibly we could fix this by hacking the ts parser so that it would
also apply the hyphenated-word rules to a hostname containing a dash.
In general though, there are always going to be cases where prefix
match doesn't work because of dictionary transformations ...
regards, tom lane
On 05-12-2011 12:29, Marcin Kasperski wrote:
'goog:*' matches google.com
but 'e-goog:*' does not match e-google.comIt is a known limitation. The text search parser ignores some uncommon cases.
See TODO and archives.Could you suggest me what to look for? I don't see anything related on
http://wiki.postgresql.org/wiki/Todo#Text_Search
and I already tried numerous searches to find similar problems, but
failed to locate anything related�
Improve handling of plus signs in email address user names, and perhaps
improve URL parsing
Search for "url text search parser" in the archives.
--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Import Notes
Reply to msg id not found: CAJQm8RYLaTFS7VeSN+=FXiurzJZfpxSKrpO-dxUECtqKXEJzwQ@mail.gmail.com
On Mon, 5 Dec 2011, Tom Lane wrote:
Marcin.Kasperski@mekk.waw.pl writes:
Synopsis
========='goog:*' matches google.com
but
'e-goog:*' does not match e-google.comThe reason for this seems to be that the pattern is treated as a
hyphenated word:regression=# select TO_TSQUERY('english', 'e-goog:*');
to_tsquery
-------------------------------
'e-goog':* & 'e':* & 'goog':*
(1 row)but the hostname isn't:
regression=# select TO_TSVECTOR('english', 'See e-google.com');
to_tsvector
--------------------------
'e-google.com':2 'see':1
(1 row)If you change the text so it's not recognized as a hostname, you get
lexemes that would match the query:regression=# select TO_TSVECTOR('english', 'See e-google com');
to_tsvector
---------------------------------------------
'com':5 'e':3 'e-googl':2 'googl':4 'see':1
(1 row)Possibly we could fix this by hacking the ts parser so that it would
also apply the hyphenated-word rules to a hostname containing a dash.In general though, there are always going to be cases where prefix
match doesn't work because of dictionary transformations ...
I'd index 'after dictionary transformations' lexemes as well as an
original to let prefix march always work.
regards, tom lane
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83