Tsearch2 custom dictionaries
Part1.
I have created a dictionary called 'webwords' which checks all words
and curtails them to 300 chars (for now)
after running
make
make install
I then copied the lib_webwords.so into my $libdir
I have run
psql mybd < dict_webwords.sql
The tutorial shows how to install the intdict for integer types. How
should i install my custom dictionary?
Part2.
The dictionary I am trying to create is to be used for searching
multilingual text. My aim is to have fast search over all text, but
ignore binary encoded data which is also present. (i will probably move
to ignoring long words in the text eventually).
What is the best approach to tackle this problem?
As the text can be multilingual I don't think stemming is possible?
I also need to include many none-standard words in the index such as
urls and message ID's contained in the text.
I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!
Thanks...
--
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
Part1.
I have created a dictionary called 'webwords' which checks all words
and curtails them to 300 chars (for now)after running
make
make installI then copied the lib_webwords.so into my $libdir
I have run
psql mybd < dict_webwords.sql
The tutorial shows how to install the intdict for integer types. How
should i install my custom dictionary?
Once you did 'psql mybd < dict_webwords.sql' you should be able use it :)
Test it :
select lexize('webwords','some_web_word');
Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict
Part2.
The dictionary I am trying to create is to be used for searching
multilingual text. My aim is to have fast search over all text, but
ignore binary encoded data which is also present. (i will probably move
to ignoring long words in the text eventually).
What is the best approach to tackle this problem?
As the text can be multilingual I don't think stemming is possible?
You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(
I also need to include many none-standard words in the index such as
urls and message ID's contained in the text.
What's message ID ? Integer ? it's already recognized by parser.
try
select * from token_type();
Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug
apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
ts_name | tok_type | description | token | dict_name | tsvector
---------+----------+-------------+----------------+-----------+------------------
simple | host | Host | www.sai.msu.su | {simple} | 'www.sai.msu.su'
simple | lword | Latin word | megera | {simple} | 'megera'
(2 rows)
I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!
You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief
Thanks...
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
Part1.
I have created a dictionary called 'webwords' which checks all
words
and curtails them to 300 chars (for now)
after running
make
make installI then copied the lib_webwords.so into my $libdir
I have run
psql mybd < dict_webwords.sql
Once you did 'psql mybd < dict_webwords.sql' you should be able use
it :)
Test it :
select lexize('webwords','some_web_word');
I did test it with
select lexize('webwords','some_web_word');
lexize
-------
{some_web_word}
select lexize('webwords','some_400char_web_word');
lexize
--------
{some_shortened_web_word}
so that bit works, but then I tried
SELECT to_tsvector( 'webwords', 'my words' );
Error: No tsearch config
Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict
yeah, i did read it - its good!
should i run:
update pg_ts_cfgmap set dict_name='{webwords}';
Part2.
<snip>
As the text can be multilingual I don't think stemming is possible?
You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(
My database was created as unicode - does this mean I cannot use
tsaerch?!
I also need to include many none-standard words in the index such
as
urls and message ID's contained in the text.
What's message ID ? Integer ? it's already recognized by parser.
try
select * from token_type();Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debugapod=# select * from ts_debug('http://www.sai.msu.su/~megera');
ts_name | tok_type | description | token | dict_name |
tsvector
---------+----------+-------------+----------------+-----------+------
------------
simple | host | Host | www.sai.msu.su | {simple} | 'www.
sai.msu.su'
simple | lword | Latin word | megera | {simple} | '
megera'
(2 rows)
I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!
You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief
Parser writing...scary stuff :-)
Thanks!
--
Import Notes
Resolved by subject fallback
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
Part1.
I have created a dictionary called 'webwords' which checks all
words
and curtails them to 300 chars (for now)
after running
make
make installI then copied the lib_webwords.so into my $libdir
I have run
psql mybd < dict_webwords.sql
Once you did 'psql mybd < dict_webwords.sql' you should be able use
it :)
Test it :
select lexize('webwords','some_web_word');I did test it with
select lexize('webwords','some_web_word');
lexize
-------
{some_web_word}select lexize('webwords','some_400char_web_word');
lexize
--------
{some_shortened_web_word}so that bit works, but then I tried
SELECT to_tsvector( 'webwords', 'my words' );
Error: No tsearch config
from ref.guide:
to_tsvector( [configuration,] document TEXT) RETURNS tsvector
Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict
yeah, i did read it - its good!
should i run:
update pg_ts_cfgmap set dict_name='{webwords}';
after loading your dictionary to db you should have it registered in
pg_ts_dict, try
select * from pg_ts_dict;
next, you need to read docs, for example
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
how to create your configuration and specify lexem_type-dictionary
mapping;
Part2.
<snip>
As the text can be multilingual I don't think stemming is possible?
You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(My database was created as unicode - does this mean I cannot use
tsaerch?!
We have no any experience with UTF, so you may better ask openfts mailing
list and read archives.
I also need to include many none-standard words in the index such
as
urls and message ID's contained in the text.
What's message ID ? Integer ? it's already recognized by parser.
try
select * from token_type();Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debugapod=# select * from ts_debug('http://www.sai.msu.su/~megera');
ts_name | tok_type | description | token | dict_name |tsvector
---------+----------+-------------+----------------+-----------+------
------------
simple | host | Host | www.sai.msu.su | {simple} | 'www.
sai.msu.su'
simple | lword | Latin word | megera | {simple} | '
megera'
(2 rows)
I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!
You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_BriefParser writing...scary stuff :-)
Thanks!
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83