Tsearch2 custom dictionaries

Started by Matover 22 years ago4 messagesgeneral
Jump to latest
#1Mat
psql-mail@freeuk.com

Part1.

I have created a dictionary called 'webwords' which checks all words
and curtails them to 300 chars (for now)

after running
make
make install

I then copied the lib_webwords.so into my $libdir

I have run

psql mybd < dict_webwords.sql

The tutorial shows how to install the intdict for integer types. How
should i install my custom dictionary?

Part2.

The dictionary I am trying to create is to be used for searching
multilingual text. My aim is to have fast search over all text, but
ignore binary encoded data which is also present. (i will probably move
to ignoring long words in the text eventually).
What is the best approach to tackle this problem?
As the text can be multilingual I don't think stemming is possible?
I also need to include many none-standard words in the index such as
urls and message ID's contained in the text.

I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!

Thanks...

--

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Mat (#1)
Re: Tsearch2 custom dictionaries

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

Part1.

I have created a dictionary called 'webwords' which checks all words
and curtails them to 300 chars (for now)

after running
make
make install

I then copied the lib_webwords.so into my $libdir

I have run

psql mybd < dict_webwords.sql

The tutorial shows how to install the intdict for integer types. How
should i install my custom dictionary?

Once you did 'psql mybd < dict_webwords.sql' you should be able use it :)
Test it :
select lexize('webwords','some_web_word');

Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict

Part2.

The dictionary I am trying to create is to be used for searching
multilingual text. My aim is to have fast search over all text, but
ignore binary encoded data which is also present. (i will probably move
to ignoring long words in the text eventually).
What is the best approach to tackle this problem?
As the text can be multilingual I don't think stemming is possible?

You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(

I also need to include many none-standard words in the index such as
urls and message ID's contained in the text.

What's message ID ? Integer ? it's already recognized by parser.

try
select * from token_type();

Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug

apod=# select * from ts_debug('http://www.sai.msu.su/~megera&#39;);
ts_name | tok_type | description | token | dict_name | tsvector
---------+----------+-------------+----------------+-----------+------------------
simple | host | Host | www.sai.msu.su | {simple} | 'www.sai.msu.su'
simple | lword | Latin word | megera | {simple} | 'megera'
(2 rows)

I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!

You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

Thanks...

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#3Mat
psql-mail@freeuk.com
In reply to: Oleg Bartunov (#2)
Re: Tsearch2 custom dictionaries

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

Part1.

I have created a dictionary called 'webwords' which checks all

words

and curtails them to 300 chars (for now)

after running
make
make install

I then copied the lib_webwords.so into my $libdir

I have run

psql mybd < dict_webwords.sql

Once you did 'psql mybd < dict_webwords.sql' you should be able use

it :)

Test it :
select lexize('webwords','some_web_word');

I did test it with
select lexize('webwords','some_web_word');
lexize
-------
{some_web_word}

select lexize('webwords','some_400char_web_word');
lexize
--------
{some_shortened_web_word}

so that bit works, but then I tried

SELECT to_tsvector( 'webwords', 'my words' );
Error: No tsearch config

Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict

yeah, i did read it - its good!
should i run:
update pg_ts_cfgmap set dict_name='{webwords}';

Part2.

<snip>

As the text can be multilingual I don't think stemming is possible?

You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(

My database was created as unicode - does this mean I cannot use
tsaerch?!

I also need to include many none-standard words in the index such

as

urls and message ID's contained in the text.

What's message ID ? Integer ? it's already recognized by parser.

try
select * from token_type();

Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug

apod=# select * from ts_debug('http://www.sai.msu.su/~megera&#39;);
ts_name | tok_type | description | token | dict_name |

tsvector

---------+----------+-------------+----------------+-----------+------

------------

simple | host | Host | www.sai.msu.su | {simple} | 'www.

sai.msu.su'

simple | lword | Latin word | megera | {simple} | '

megera'

(2 rows)

I get the feeling that building these indexs will by no means be an

easy task so any suggestions will be gratefully recieved!

You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

Parser writing...scary stuff :-)

Thanks!

--

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Mat (#3)
Re: Tsearch2 custom dictionaries

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

Part1.

I have created a dictionary called 'webwords' which checks all

words

and curtails them to 300 chars (for now)

after running
make
make install

I then copied the lib_webwords.so into my $libdir

I have run

psql mybd < dict_webwords.sql

Once you did 'psql mybd < dict_webwords.sql' you should be able use

it :)

Test it :
select lexize('webwords','some_web_word');

I did test it with
select lexize('webwords','some_web_word');
lexize
-------
{some_web_word}

select lexize('webwords','some_400char_web_word');
lexize
--------
{some_shortened_web_word}

so that bit works, but then I tried

SELECT to_tsvector( 'webwords', 'my words' );
Error: No tsearch config

from ref.guide:
to_tsvector( [configuration,] document TEXT) RETURNS tsvector

Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict

yeah, i did read it - its good!
should i run:
update pg_ts_cfgmap set dict_name='{webwords}';

after loading your dictionary to db you should have it registered in
pg_ts_dict, try

select * from pg_ts_dict;

next, you need to read docs, for example
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
how to create your configuration and specify lexem_type-dictionary
mapping;

Part2.

<snip>

As the text can be multilingual I don't think stemming is possible?

You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(

My database was created as unicode - does this mean I cannot use
tsaerch?!

We have no any experience with UTF, so you may better ask openfts mailing
list and read archives.

I also need to include many none-standard words in the index such

as

urls and message ID's contained in the text.

What's message ID ? Integer ? it's already recognized by parser.

try
select * from token_type();

Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug

apod=# select * from ts_debug('http://www.sai.msu.su/~megera&#39;);
ts_name | tok_type | description | token | dict_name |

tsvector

---------+----------+-------------+----------------+-----------+------

------------

simple | host | Host | www.sai.msu.su | {simple} | 'www.

sai.msu.su'

simple | lword | Latin word | megera | {simple} | '

megera'

(2 rows)

I get the feeling that building these indexs will by no means be an

easy task so any suggestions will be gratefully recieved!

You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief

Parser writing...scary stuff :-)

Thanks!

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83