questions about tsearch2 (for czech language)

Started by Pavel Stehuleover 22 years ago7 messagesgeneral
Jump to latest
#1Pavel Stehule
pavel.stehule@gmail.com

Hello

I try tsearch2 within czech environment. It is works fine, but I have two
questions.

1. I have words "se", "ve" in my czech stop words. But I get this words in
result. Why? Have I problem with my configuration?

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.');
ts_name | tok_type | description | token | dict_name | tsvector
---------------+----------+-------------+---------+-------------+-----------
default_czech | lword | Latin word | jmenuji | {cz_ispell} |
'jmenuji'
default_czech | lword | Latin word | se | {cz_ispell} | 'se'
default_czech | lword | Latin word | Pavel | {cz_ispell} | 'pavel'
default_czech | word | Word | St�hule | {cz_ispell} |
default_czech | lword | Latin word | a | {cz_ispell} |
default_czech | word | Word | bydl�m | {cz_ispell} | 'bydlet'
default_czech | lword | Latin word | ve | {cz_ispell} | 've'
default_czech | lword | Latin word | Skalici | {cz_ispell} |
'skalici'
(8 ��dek)

tsearch2=# select * from pg_ts_cfgmap where ts_name='default_czech';
ts_name | tok_alias | dict_name
---------------+--------------+-------------
default_czech | email | {simple}
default_czech | file | {simple}
default_czech | float | {simple}
default_czech | host | {simple}
default_czech | hword | {cz_ispell}
default_czech | int | {simple}
default_czech | lhword | {cz_ispell}
default_czech | lpart_hword | {cz_ispell}
default_czech | lword | {cz_ispell}
default_czech | nlhword | {cz_ispell}
default_czech | nlpart_hword | {cz_ispell}
default_czech | nlword | {cz_ispell}
default_czech | part_hword | {simple}
default_czech | sfloat | {simple}
default_czech | uint | {simple}
default_czech | uri | {simple}
default_czech | url | {simple}
default_czech | version | {simple}
default_czech | word | {cz_ispell}
(19 ��dek)

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample St�hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | St�hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydl�m | {cz_ispell,simple} |
'bydlet'

Thank You
Pavel Stehule

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Pavel Stehule (#1)
Re: questions about tsearch2 (for czech language)

On Mon, 22 Dec 2003, Pavel Stehule wrote:

Hello

I try tsearch2 within czech environment. It is works fine, but I have two
questions.

1. I have words "se", "ve" in my czech stop words. But I get this words in
result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from ts_debug('jmenuji se Pavel StО©╫hule a bydlО©╫m ve
Skalici.');
ts_name | tok_type | description | token | dict_name | tsvector
---------------+----------+-------------+---------+-------------+-----------
default_czech | lword | Latin word | jmenuji | {cz_ispell} |
'jmenuji'
default_czech | lword | Latin word | se | {cz_ispell} | 'se'
default_czech | lword | Latin word | Pavel | {cz_ispell} | 'pavel'
default_czech | word | Word | StО©╫hule | {cz_ispell} |
default_czech | lword | Latin word | a | {cz_ispell} |
default_czech | word | Word | bydlО©╫m | {cz_ispell} | 'bydlet'
default_czech | lword | Latin word | ve | {cz_ispell} | 've'
default_czech | lword | Latin word | Skalici | {cz_ispell} |
'skalici'
(8 О©╫О©╫dek)

tsearch2=# select * from pg_ts_cfgmap where ts_name='default_czech';
ts_name | tok_alias | dict_name
---------------+--------------+-------------
default_czech | email | {simple}
default_czech | file | {simple}
default_czech | float | {simple}
default_czech | host | {simple}
default_czech | hword | {cz_ispell}
default_czech | int | {simple}
default_czech | lhword | {cz_ispell}
default_czech | lpart_hword | {cz_ispell}
default_czech | lword | {cz_ispell}
default_czech | nlhword | {cz_ispell}
default_czech | nlpart_hword | {cz_ispell}
default_czech | nlword | {cz_ispell}
default_czech | part_hword | {simple}
default_czech | sfloat | {simple}
default_czech | uint | {simple}
default_czech | uri | {simple}
default_czech | url | {simple}
default_czech | version | {simple}
default_czech | word | {cz_ispell}
(19 О©╫О©╫dek)

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample StО©╫hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel StО©╫hule a bydlО©╫m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | StО©╫hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydlО©╫m | {cz_ispell,simple} |
'bydlet'

Thank You
Pavel Stehule

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Oleg Bartunov (#2)
Re: questions about tsearch2 (for czech language)

result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell';
-[ RECORD 1
]---+--------------------------------------------------------------------------------------------------------------------------
dict_name | cz_ispell
dict_init | 173405
dict_initoption |
DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop"
dict_lexize | 173406
dict_comment |

[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]."
se
sem
si
sv�j
ve
v�m
vďż˝
viz
vy

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample St�hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | St�hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydl�m | {cz_ispell,simple} |
'bydlet'

If tsearch didn't find word in dictionary, then erase this from result.
True? My surname, fo example isn't in dictionary, but I wont save this
word in result (tsvector).

I use

tsearch2=# select version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3
20030715 (Red Hat Linux 3.3-14)

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Pavel Stehule (#3)
Re: questions about tsearch2 (for czech language)

Pavel,

did you restart psql session after modifying tsearch2 configuration ?
btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html
We have utility to convert myspell dicts to ispell one. It's included
in 7.5 development. Patch for 7.4 could be downloaded from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Also, historically, we use openfts mailing list for discussion of
tsearch2.

Oleg
On Mon, 22 Dec 2003, Pavel Stehule wrote:

result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell';
-[ RECORD 1
]---+--------------------------------------------------------------------------------------------------------------------------
dict_name | cz_ispell
dict_init | 173405
dict_initoption |
DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop"
dict_lexize | 173406
dict_comment |

[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]."
se
sem
si
svО©╫j
ve
vО©╫m
vО©╫
viz
vy

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample StО©╫hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel StО©╫hule a bydlО©╫m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | StО©╫hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydlО©╫m | {cz_ispell,simple} |
'bydlet'

If tsearch didn't find word in dictionary, then erase this from result.
True? My surname, fo example isn't in dictionary, but I wont save this
word in result (tsvector).

I use

tsearch2=# select version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3
20030715 (Red Hat Linux 3.3-14)

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Oleg Bartunov (#4)
Solved: questions about tsearch2 (for czech language)

Oleg

You has true. After restart of postmaster all works fine.

tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel St�hule');
to_tsvector
------------------------------------
'pavel':3 'st�hule':4 'jmenovat':1

Thank You very much

Pavel Stehule

On Mon, 22 Dec 2003, Oleg Bartunov wrote:

Show quoted text

Pavel,

did you restart psql session after modifying tsearch2 configuration ?
btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html
We have utility to convert myspell dicts to ispell one. It's included
in 7.5 development. Patch for 7.4 could be downloaded from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Also, historically, we use openfts mailing list for discussion of
tsearch2.

Oleg
On Mon, 22 Dec 2003, Pavel Stehule wrote:

result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell';
-[ RECORD 1
]---+--------------------------------------------------------------------------------------------------------------------------
dict_name | cz_ispell
dict_init | 173405
dict_initoption |
DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop"
dict_lexize | 173406
dict_comment |

[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]."
se
sem
si
sv�j
ve
v�m
vďż˝
viz
vy

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample St�hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | St�hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydl�m | {cz_ispell,simple} |
'bydlet'

If tsearch didn't find word in dictionary, then erase this from result.
True? My surname, fo example isn't in dictionary, but I wont save this
word in result (tsvector).

I use

tsearch2=# select version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3
20030715 (Red Hat Linux 3.3-14)

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

#6Teodor Sigaev
teodor@sigaev.ru
In reply to: Pavel Stehule (#5)
Re: Solved: questions about tsearch2 (for czech language)

You has true. After restart of postmaster all works fine.

One comment, you don't need restart postmaster, you should reconnect to
postgresql by exit and start psql. Every new connect creates new child of
postmaster.

tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel St�hule');
to_tsvector
------------------------------------
'pavel':3 'st�hule':4 'jmenovat':1

Thank You very much

Pavel Stehule

On Mon, 22 Dec 2003, Oleg Bartunov wrote:

Pavel,

did you restart psql session after modifying tsearch2 configuration ?
btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html
We have utility to convert myspell dicts to ispell one. It's included
in 7.5 development. Patch for 7.4 could be downloaded from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Also, historically, we use openfts mailing list for discussion of
tsearch2.

Oleg
On Mon, 22 Dec 2003, Pavel Stehule wrote:

result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell';
-[ RECORD 1
]---+--------------------------------------------------------------------------------------------------------------------------
dict_name | cz_ispell
dict_init | 173405
dict_initoption |
DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop"
dict_lexize | 173406
dict_comment |

[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]."
se
sem
si
sv�j
ve
v�m
vďż˝
viz
vy

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample St�hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | St�hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydl�m | {cz_ispell,simple} |
'bydlet'

If tsearch didn't find word in dictionary, then erase this from result.
True? My surname, fo example isn't in dictionary, but I wont save this
word in result (tsvector).

I use

tsearch2=# select version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3
20030715 (Red Hat Linux 3.3-14)

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
Teodor Sigaev E-mail: teodor@sigaev.ru

#7Pavel Stehule
pavel.stehule@gmail.com
In reply to: Teodor Sigaev (#6)
Re: Solved: questions about tsearch2 (for czech language)

On Tue, 23 Dec 2003, Teodor Sigaev wrote:

You has true. After restart of postmaster all works fine.

One comment, you don't need restart postmaster, you should reconnect to
postgresql by exit and start psql. Every new connect creates new child of
postmaster.

true, but I like hard solutions, :->
"/etc/init.d/postgresql restart" is my top command

I work only one on this database, a can use en force.

Pavel

Show quoted text

tsearch2=# select to_tsvector('default_czech','Jmenuji se Pavel St�hule');
to_tsvector
------------------------------------
'pavel':3 'st�hule':4 'jmenovat':1

Thank You very much

Pavel Stehule

On Mon, 22 Dec 2003, Oleg Bartunov wrote:

Pavel,

did you restart psql session after modifying tsearch2 configuration ?
btw, there is czech dictionary available from http://lingucomponent.openoffice.org/download_dictionary.html
We have utility to convert myspell dicts to ispell one. It's included
in 7.5 development. Patch for 7.4 could be downloaded from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Also, historically, we use openfts mailing list for discussion of
tsearch2.

Oleg
On Mon, 22 Dec 2003, Pavel Stehule wrote:

result. Why? Have I problem with my configuration?

did you specify stop words in dictionaries configuration ?

select * from pg_ts_dict;

tsearch2=# select * from pg_ts_dict where dict_name ='cz_ispell';
-[ RECORD 1
]---+--------------------------------------------------------------------------------------------------------------------------
dict_name | cz_ispell
dict_init | 173405
dict_initoption |
DictFile="/usr/lib/ispell/czech",AffFile="/usr/lib/ispell/czech.aff",StopFile="/usr/local/pgsql/share/contrib/czech.stop"
dict_lexize | 173406
dict_comment |

[postgres@usop root]$ cat /usr/local/pgsql/share/contrib/czech.stop|grep -e "^[sv]."
se
sem
si
sv�j
ve
v�m
vďż˝
viz
vy

2. I use small czech dictionary. I need don't erase words which aren't in
dictionary (in my sample St�hule). Can I set it somewhere? I tryed add
simple dict into cfg map, but witout sucess

Example, please ! What do you mean 'erase words' ?

tsearch2=# select * from ts_debug('jmenuji se Pavel St�hule a bydl�m ve
Skalici.'); ts_name | tok_type | description | token |
dict_name | tsvector
---------------+----------+-------------+---------+--------------------+-----------
default_czech | word | Word | St�hule | {cz_ispell,simple} |
default_czech | lword | Latin word | a | {cz_ispell,simple} |
default_czech | word | Word | bydl�m | {cz_ispell,simple} |
'bydlet'

If tsearch didn't find word in dictionary, then erase this from result.
True? My surname, fo example isn't in dictionary, but I wont save this
word in result (tsvector).

I use

tsearch2=# select version();
version
-------------------------------------------------------------------------------------------------------
PostgreSQL 7.4RC2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3
20030715 (Red Hat Linux 3.3-14)

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly