tsearch2 anomoly?

Started by Bob Gobeilleover 18 years ago4 messagesgeneral
Jump to latest
#1Bob Gobeille
bob.gobeille@hp.com

I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me
and is exactly what I want. It looks like the entire string is
indexed plus the substrings broken by '-' are indexed.

ossdb=# select to_tsvector('gallery2-httpd-conf');
to_tsvector
---------------------------------------------------------
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1

However, I'd expect the same to happen in the httpd example - but it
does not appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
to_tsvector
---------------------------
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?

Thank you!
Bob

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Bob Gobeille (#1)
Re: tsearch2 anomoly?

This is how default parser works. See output from
select * from ts_debug('gallery2-httpd-conf');
and
select * from ts_debug('httpd-2.2.3-5.src.rpm');

All token type:

select * from token_type();

On Thu, 6 Sep 2007, RC Gobeille wrote:

I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me and is
exactly what I want. It looks like the entire string is indexed plus the
substrings broken by '-' are indexed.

ossdb=# select to_tsvector('gallery2-httpd-conf');
to_tsvector
---------------------------------------------------------
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1

However, I'd expect the same to happen in the httpd example - but it does not
appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
to_tsvector
---------------------------
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?

Thank you!
Bob

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Bob Gobeille
bob.gobeille@hp.com
In reply to: Oleg Bartunov (#2)
Re: tsearch2 anomoly?

Thanks and I didn't know about ts_debug, so thanks for that also.

For the record, I see how to use my own processing function (e.g.
dropatsymbol) to get what I need:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro
.html

However, can you explain the logic behind the parsing difference if I just
add a ".s" to a string:

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.');
ts_debug
-----------------------------------------------------------------------
(default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd'
'gallery2' 'gallery2-httpd-2'")
(default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2')
(default,lpart_hword,"Latin part of hyphenated
word",httpd,{en_stem},'httpd')
(default,float,"Decimal notation",2.1,{simple},'2.1')
(default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf')
(5 rows)

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s');
ts_debug
---------------------------------------------------------------------
(default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c
onf.s')
(1 row)

Thanks again,
Bob

On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote:

Show quoted text

This is how default parser works. See output from
select * from ts_debug('gallery2-httpd-conf');
and
select * from ts_debug('httpd-2.2.3-5.src.rpm');

All token type:

select * from token_type();

On Thu, 6 Sep 2007, RC Gobeille wrote:

I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me and is
exactly what I want. It looks like the entire string is indexed plus the
substrings broken by '-' are indexed.

ossdb=# select to_tsvector('gallery2-httpd-conf');
to_tsvector
---------------------------------------------------------
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1

However, I'd expect the same to happen in the httpd example - but it does not
appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
to_tsvector
---------------------------
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?

Thank you!
Bob

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Teodor Sigaev
teodor@sigaev.ru
In reply to: Bob Gobeille (#3)
Re: tsearch2 anomoly?

Usual text hasn't strict syntax rules, so parser tries to recognize most
probable token. Something with '.', '-' and alnum characters is often a
filename, but filename is very rare finished or started by dot.

RC Gobeille wrote:

Thanks and I didn't know about ts_debug, so thanks for that also.

For the record, I see how to use my own processing function (e.g.
dropatsymbol) to get what I need:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro
.html

However, can you explain the logic behind the parsing difference if I just
add a ".s" to a string:

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.');
ts_debug
-----------------------------------------------------------------------
(default,hword,"Hyphenated word",gallery2-httpd-2,{simple},"'2' 'httpd'
'gallery2' 'gallery2-httpd-2'")
(default,part_hword,"Part of hyphenated word",gallery2,{simple},'gallery2')
(default,lpart_hword,"Latin part of hyphenated
word",httpd,{en_stem},'httpd')
(default,float,"Decimal notation",2.1,{simple},'2.1')
(default,lpart_hword,"Latin part of hyphenated word",conf,{en_stem},'conf')
(5 rows)

ossdb=# select ts_debug('gallery2-httpd-2.1-conf.s');
ts_debug
---------------------------------------------------------------------
(default,host,Host,gallery2-httpd-2.1-conf.s,{simple},'gallery2-httpd-2.1-c
onf.s')
(1 row)

Thanks again,
Bob

On 9/6/07 11:19 AM, "Oleg Bartunov" <oleg@sai.msu.su> wrote:

This is how default parser works. See output from
select * from ts_debug('gallery2-httpd-conf');
and
select * from ts_debug('httpd-2.2.3-5.src.rpm');

All token type:

select * from token_type();

On Thu, 6 Sep 2007, RC Gobeille wrote:

I'm having trouble understanding to_tsvector. (PostreSQL 8.1.9 contrib)

In this first case converting 'gallery2-httpd-conf' makes sense to me and is
exactly what I want. It looks like the entire string is indexed plus the
substrings broken by '-' are indexed.

ossdb=# select to_tsvector('gallery2-httpd-conf');
to_tsvector
---------------------------------------------------------
'conf':4 'httpd':3 'gallery2':2 'gallery2-httpd-conf':1

However, I'd expect the same to happen in the httpd example - but it does not
appear to.

ossdb=# select to_tsvector('httpd-2.2.3-5.src.rpm');
to_tsvector
---------------------------
'httpd-2.2.3-5.src.rpm':1

Why don't I get: 'httpd', 'src', 'rpm', 'httpd-2.2.3-5.src.rpm' ?

Is this a bug or design?

Thank you!
Bob

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/