BUG #6327: Prefix full-text-search fails for hosts with complicated names

Started by Nonameover 14 years ago5 messagesbugs
Jump to latest
#1Noname
Marcin.Kasperski@mekk.waw.pl

The following bug has been logged on the website:

Bug reference: 6327
Logged by: Marcin Kasperski
Email address: Marcin.Kasperski@mekk.waw.pl
PostgreSQL version: 9.1.1
Operating system: Linux
Description:

Synopsis
=========

'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com

Example SQL
=============

Try the queries below. Note ismatch column, which is t in the former, and f
in the latter case (IMHO should be t in both).

SELECT a query, b message, a@@b ismatch FROM (
SELECT TO_TSQUERY('english', 'goog:*') a,
TO_TSVECTOR('english', 'See google.com') b) as foo;

SELECT a query, b message, a@@b ismatch FROM (
SELECT TO_TSQUERY('english', 'e-goog:*') a,
TO_TSVECTOR('english', 'See e-google.com') b) as foo;

In reply to: Noname (#1)
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

On 05-12-2011 09:40, Marcin.Kasperski@mekk.waw.pl wrote:

'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com

It is a known limitation. The text search parser ignores some uncommon cases.
See TODO and archives.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#1)
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

Marcin.Kasperski@mekk.waw.pl writes:

Synopsis
=========

'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com

The reason for this seems to be that the pattern is treated as a
hyphenated word:

regression=# select TO_TSQUERY('english', 'e-goog:*');
to_tsquery
-------------------------------
'e-goog':* & 'e':* & 'goog':*
(1 row)

but the hostname isn't:

regression=# select TO_TSVECTOR('english', 'See e-google.com');
to_tsvector
--------------------------
'e-google.com':2 'see':1
(1 row)

If you change the text so it's not recognized as a hostname, you get
lexemes that would match the query:

regression=# select TO_TSVECTOR('english', 'See e-google com');
to_tsvector
---------------------------------------------
'com':5 'e':3 'e-googl':2 'googl':4 'see':1
(1 row)

Possibly we could fix this by hacking the ts parser so that it would
also apply the hyphenated-word rules to a hostname containing a dash.

In general though, there are always going to be cases where prefix
match doesn't work because of dictionary transformations ...

regards, tom lane

In reply to: Noname (#1)
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

On 05-12-2011 12:29, Marcin Kasperski wrote:

'goog:*' matches google.com
but 'e-goog:*' does not match e-google.com

It is a known limitation. The text search parser ignores some uncommon cases.
See TODO and archives.

Could you suggest me what to look for? I don't see anything related on
http://wiki.postgresql.org/wiki/Todo#Text_Search
and I already tried numerous searches to find similar problems, but
failed to locate anything related�

Improve handling of plus signs in email address user names, and perhaps
improve URL parsing

Search for "url text search parser" in the archives.

--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#5Oleg Bartunov
oleg@sai.msu.su
In reply to: Tom Lane (#3)
Re: BUG #6327: Prefix full-text-search fails for hosts with complicated names

On Mon, 5 Dec 2011, Tom Lane wrote:

Marcin.Kasperski@mekk.waw.pl writes:

Synopsis
=========

'goog:*' matches google.com
but
'e-goog:*' does not match e-google.com

The reason for this seems to be that the pattern is treated as a
hyphenated word:

regression=# select TO_TSQUERY('english', 'e-goog:*');
to_tsquery
-------------------------------
'e-goog':* & 'e':* & 'goog':*
(1 row)

but the hostname isn't:

regression=# select TO_TSVECTOR('english', 'See e-google.com');
to_tsvector
--------------------------
'e-google.com':2 'see':1
(1 row)

If you change the text so it's not recognized as a hostname, you get
lexemes that would match the query:

regression=# select TO_TSVECTOR('english', 'See e-google com');
to_tsvector
---------------------------------------------
'com':5 'e':3 'e-googl':2 'googl':4 'see':1
(1 row)

Possibly we could fix this by hacking the ts parser so that it would
also apply the hyphenated-word rules to a hostname containing a dash.

In general though, there are always going to be cases where prefix
match doesn't work because of dictionary transformations ...

I'd index 'after dictionary transformations' lexemes as well as an
original to let prefix march always work.

regards, tom lane

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83