processing urls with tsearch2

Started by Laimonas Simutisover 18 years ago4 messagesgeneral
Jump to latest
#1Laimonas Simutis
laimis@gmail.com

Hey guys,

maybe anyone using tsearch2 could advise on this. With the default
installation, url, host and some other tokens are processed with the simple
dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'. The
parser correctly assigns token id of type host to the term, but then the
dictionary the terms gets routed through is simple and what gets stored is
mywebsite.com

The questions are:

1) is there a dictionary available that I could utilize that will remove
.com, .net, .org, etc? I could write one myself, but after seeing some
sample dictionary implementations and C code I try to avoid, I got scared a
bit.

2) has anyone else dealt with this maybe in a different way?

Thanks for any suggestions and help,

Laimis

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Laimonas Simutis (#1)
Re: processing urls with tsearch2

On Thu, 13 Sep 2007, Laimonas Simutis wrote:

Hey guys,

maybe anyone using tsearch2 could advise on this. With the default
installation, url, host and some other tokens are processed with the simple
dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'. The
parser correctly assigns token id of type host to the term, but then the
dictionary the terms gets routed through is simple and what gets stored is
mywebsite.com

The questions are:

1) is there a dictionary available that I could utilize that will remove
.com, .net, .org, etc? I could write one myself, but after seeing some
sample dictionary implementations and C code I try to avoid, I got scared a
bit.

Yes, we have dict_regex, which was developed by Sergey Karpov, see details
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html
It uses pcre library and you need to know perl regexps.

2) has anyone else dealt with this maybe in a different way?

sure, preprocess text using prefered language before passing to ro_tsvector

Thanks for any suggestions and help,

Laimis

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Laimonas Simutis
laimis@gmail.com
In reply to: Oleg Bartunov (#2)
Re: processing urls with tsearch2

Any way to install the dictionary without the make? As in is there binary
versions of it available? I am running postgresql on windows servers...

Show quoted text

On 9/13/07, Oleg Bartunov <oleg@sai.msu.su> wrote:

On Thu, 13 Sep 2007, Laimonas Simutis wrote:

Hey guys,

maybe anyone using tsearch2 could advise on this. With the default
installation, url, host and some other tokens are processed with the

simple

dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'.

The

parser correctly assigns token id of type host to the term, but then the
dictionary the terms gets routed through is simple and what gets stored

is

mywebsite.com

The questions are:

1) is there a dictionary available that I could utilize that will remove
.com, .net, .org, etc? I could write one myself, but after seeing some
sample dictionary implementations and C code I try to avoid, I got

scared a

bit.

Yes, we have dict_regex, which was developed by Sergey Karpov, see details
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html
It uses pcre library and you need to know perl regexps.

2) has anyone else dealt with this maybe in a different way?

sure, preprocess text using prefered language before passing to
ro_tsvector

Thanks for any suggestions and help,

Laimis

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Laimonas Simutis
laimis@gmail.com
In reply to: Laimonas Simutis (#3)
Re: processing urls with tsearch2

Thanks for the advice, for right now I went with the second option of
preprocessing the text before passing it to the to_tsquery.

However I would like to see what it would take to get some of the
dictionaries available out there to be hooked into the postgres on windows.
Does anyone have any pointers or ideas on where I can start to look if I
want to compile and add a dictionary to tsearch2 but on windows environment?

Thanks,

Laimis

Show quoted text

On 9/13/07, Laimonas Simutis <laimis@gmail.com> wrote:

Any way to install the dictionary without the make? As in is there binary
versions of it available? I am running postgresql on windows servers...

On 9/13/07, Oleg Bartunov <oleg@sai.msu.su> wrote:

On Thu, 13 Sep 2007, Laimonas Simutis wrote:

Hey guys,

maybe anyone using tsearch2 could advise on this. With the default
installation, url, host and some other tokens are processed with the

simple

dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'.

The

parser correctly assigns token id of type host to the term, but then

the

dictionary the terms gets routed through is simple and what gets

stored is

mywebsite.com

The questions are:

1) is there a dictionary available that I could utilize that will

remove

.com, .net, .org, etc? I could write one myself, but after seeing some
sample dictionary implementations and C code I try to avoid, I got

scared a

bit.

Yes, we have dict_regex, which was developed by Sergey Karpov, see
details
http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html&lt;http://lynx.sao.ru/%7Ekarpov/software/postgres_dict_regex.html&gt;
It uses pcre library and you need to know perl regexps.

2) has anyone else dealt with this maybe in a different way?

sure, preprocess text using prefered language before passing to
ro_tsvector

Thanks for any suggestions and help,

Laimis

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/&lt;http://www.sai.msu.su/%7Emegera/&gt;
phone: +007(495)939-16-83, +007(495)939-23-83