Re: TSearch2 / German compound words / UTF-8

Started by Alexander Presberabout 20 years ago12 messagesgeneral

aljoscha@weisshuhn.de

about 20 years ago

Tsearch/isepll is not able to break this word into parts, because
of the "s" in "Produktion/s/intervall". Misspelling the word as
"Produktionintervall" fixes it:

It should be affixes marked as 'affix in middle of compound word',
Flag is '~', example look in norsk dictionary:

flag ~\\:
[^S] > S #~ advarsel > advarsels-

BTW, we develop and debug compound word support on norsk
(norwegian) dictionary, so look for example there. But we don't
know Norwegian, norwegians helped us :)

Hello everyone!

I cannot get this to work. Neither in a german version, nor with the
norwegian example supplied on the tsearch website.
That means, just like Hannes I can get compound word support without
inserted 's' in german and norwegian:
"Vertragstrafe" works, but not "Vertragsstrafe", which is the right
Form.

So I tried it the other way around: My dictionary consists of two words:

---
vertrag/zs
strafe/z
---

My affixes file just switches on compounds and allows for s-insertion
as described in the norwegian tutorial:

---
compoundwords controlled z
suffixes
flag s:
[^S] > S # endet nicht auf "s": "s" anfuegen und in
compound-check ("Recht" > "Rechts-")
---

ts_debug yields:

tstest=# SELECT tsearch2.ts_debug('vertragstrafe strafevertrag
vertragsstrafe');
ts_debug
------------------------------------------------------------------------
-------------
(german,lword,"Latin
word",vertragstrafe,"{ispell_de,simple}","'strafe' 'vertrag'")
(german,lword,"Latin
word",strafevertrag,"{ispell_de,simple}","'strafe' 'vertrag'")
(german,lword,"Latin
word",vertragsstrafe,"{ispell_de,simple}",'vertragsstrafe')
(3 Zeilen)

I would say, the ispell compound support does not honor the s-Flag in
compounds.
Could it be, that this feature got lost in a regression? It must have
worked for norwegian once. (Take the "overtrekksgrilldresser" example
from the tsearch2:compounds tutorial, that I cannot reproduce).

Any hints?

Alexander

Alexander Presber

aljoscha@weisshuhn.de

about 20 years ago

In reply to: Alexander Presber (#1)

I should add that, with the minimal dictionary and .aff file,
"vertrags" gets reduced alright, dropping the trailing 's':

tstest=# SELECT tsearch2.ts_debug('vertrags');
ts_debug
---------------------------------------------------------------------
(german,lword,"Latin word",vertrags,"{ispell_de,simple}",'vertrag')
(1 Zeile)

The affix is just not applied while looking for compound words.

Sincerely yours
Alexander Presber

Import Notes

Resolved by subject fallback

Oleg Bartunov

oleg@sai.msu.su

about 20 years ago

In reply to: Alexander Presber (#1)

Alexander,

could you try tsearch2 from CVS HEAD ?
tsearch2 in 8.1.X doesn't supports UTF-8 and works for someone
only by accident :)

Oleg
On Fri, 27 Jan 2006, Alexander Presber wrote:

Tsearch/isepll is not able to break this word into parts, because of the
"s" in "Produktion/s/intervall". Misspelling the word as
"Produktionintervall" fixes it:

It should be affixes marked as 'affix in middle of compound word',
Flag is '~', example look in norsk dictionary:

flag ~\\:
[^S] > S #~ advarsel > advarsels-

BTW, we develop and debug compound word support on norsk (norwegian)
dictionary, so look for example there. But we don't know Norwegian,
norwegians helped us :)

Hello everyone!

I cannot get this to work. Neither in a german version, nor with the
norwegian example supplied on the tsearch website.
That means, just like Hannes I can get compound word support without inserted
's' in german and norwegian:
"Vertragstrafe" works, but not "Vertragsstrafe", which is the right Form.

So I tried it the other way around: My dictionary consists of two words:

---
vertrag/zs
strafe/z
---

My affixes file just switches on compounds and allows for s-insertion as
described in the norwegian tutorial:

---
compoundwords controlled z
suffixes
flag s:
[^S] > S # endet nicht auf "s": "s" anfuegen und in
compound-check ("Recht" > "Rechts-")
---

ts_debug yields:

tstest=# SELECT tsearch2.ts_debug('vertragstrafe strafevertrag
vertragsstrafe');
ts_debug
-------------------------------------------------------------------------------------
(german,lword,"Latin word",vertragstrafe,"{ispell_de,simple}","'strafe'
'vertrag'")
(german,lword,"Latin word",strafevertrag,"{ispell_de,simple}","'strafe'
'vertrag'")
(german,lword,"Latin
word",vertragsstrafe,"{ispell_de,simple}",'vertragsstrafe')
(3 Zeilen)

I would say, the ispell compound support does not honor the s-Flag in
compounds.
Could it be, that this feature got lost in a regression? It must have worked
for norwegian once. (Take the "overtrekksgrilldresser" example from the
tsearch2:compounds tutorial, that I cannot reproduce).

Any hints?

Alexander

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Teodor Sigaev

teodor@sigaev.ru

about 20 years ago

In reply to: Alexander Presber (#1)

contrib_regression=# insert into pg_ts_dict values (
'norwegian_ispell',
(select dict_init from pg_ts_dict where dict_name='ispell_template'),
'DictFile="/usr/local/share/ispell/norsk.dict" ,'
'AffFile ="/usr/local/share/ispell/norsk.aff"',
(select dict_lexize from pg_ts_dict where dict_name='ispell_template'),
'Norwegian ISpell dictionary'
);
INSERT 16681 1
contrib_regression=# select lexize('norwegian_ispell','politimester');
lexize
------------------------------------------
{politimester,politi,mester,politi,mest}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','sjokoladefabrikk');
lexize
--------------------------------------
{sjokoladefabrikk,sjokolade,fabrikk}
(1 row)

contrib_regression=# select lexize('norwegian_ispell','overtrekksgrilldresser');
lexize
-------------------------
{overtrekk,grill,dress}
(1 row)
% psql -l
List of databases
Name | Owner | Encoding
--------------------+--------+----------
contrib_regression | teodor | KOI8
postgres | pgsql | KOI8
template0 | pgsql | KOI8
template1 | pgsql | KOI8
(4 rows)

I'm afraid that UTF-8 problem. We just committed in CVS HEAD multibyte support
for tsearch2, so you can try it.

Pls, notice, the dict, aff stopword files should be in server encoding. Snowball
sources for german (and other) in UTF8 can be founded in
http://snowball.tartarus.org/dist/libstemmer_c.tgz

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Harald Armin Massa

haraldarminmassa@gmail.com

about 20 years ago

In reply to: Teodor Sigaev (#4)

Teodor,

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2
because "get stemmer here, get dictionary there..."

Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstraße 202b
70197 Stuttgart
0173/9409607

Oleg Bartunov

oleg@sai.msu.su

about 20 years ago

In reply to: Harald Armin Massa (#5)

On Fri, 27 Jan 2006, Harald Armin Massa wrote:

Teodor,

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2
because "get stemmer here, get dictionary there..."

Hmm, we could provide snowball stemmers tsearch2-ready (about 700kb),
but ispell dictionaries could be very large.

Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstra?e 202b
70197 Stuttgart
0173/9409607

Mike Rylander

mrylander@gmail.com

about 20 years ago

In reply to: Oleg Bartunov (#6)

On 1/30/06, Oleg Bartunov <oleg@sai.msu.su> wrote:

On Fri, 27 Jan 2006, Harald Armin Massa wrote:

Teodor,

To all: May be, we should put all snowball's stemmers (for all available
languages and encodings) to tsearch2 directory?

Yes, that would be VERY helpfull. Up to now I do not dare to use tsearch2
because "get stemmer here, get dictionary there..."

Hmm, we could provide snowball stemmers tsearch2-ready (about 700kb),
but ispell dictionaries could be very large.

I would be willing to host them. I have plenty of space, and
bandwidth (within reason).

Harald
--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Reinsburgstra?e 202b
70197 Stuttgart
0173/9409607

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Alexander Presber

aljoscha@weisshuhn.de

about 20 years ago

In reply to: Teodor Sigaev (#4)

Hello,

Thanks for your efforts, I still don't get it to work.
I now tried the norwegian example. My encoding is ISO-8859 (I never
used UTF-8, because I thought it would be slower, the thread name is
a bit misleading).

So I am using an ISO-8859-9 database:

~/cvs/ssd% psql -l

and a norwegian, ISO-8859 encoded dictionary and aff-file:

~% file tsearch/dict/ispell_no/norwegian.dict
tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
~% file tsearch/dict/ispell_no/norwegian.aff
tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

the aff-file contains the lines:

compoundwords controlled z
...
# to compounds only:
flag ~\\:
[^S] > S

and the dictionary containins:

overtrekk/BCW\z

(meaning: word can be compound part, intermediary "s" is allowed)

My configuration is:

Now the test:

tstest=# SELECT tsearch2.lexize('ispell_no','overtrekksgrill');
lexize
--------

(1 Zeile)

BUT:

tstest=# SELECT tsearch2.lexize('ispell_no','overtrekkgrill');
lexize
------------------------------------
{over,trekk,grill,overtrekk,grill}
(1 Zeile)

It simply doesn't work. No UTF-8 is involved.

Sincerely yours,

Alexander Presber

P.S.: Henning: Sorry for bothering you with the CC, just ignore it,
if you like.

Am 27.01.2006 um 18:17 schrieb Teodor Sigaev:

Show quoted text

contrib_regression=# insert into pg_ts_dict values (
'norwegian_ispell',
(select dict_init from pg_ts_dict where
dict_name='ispell_template'),
'DictFile="/usr/local/share/ispell/norsk.dict" ,'
'AffFile ="/usr/local/share/ispell/norsk.aff"',
(select dict_lexize from pg_ts_dict where
dict_name='ispell_template'),
'Norwegian ISpell dictionary'
);
INSERT 16681 1
contrib_regression=# select lexize('norwegian_ispell','politimester');
lexize
------------------------------------------
{politimester,politi,mester,politi,mest}
(1 row)

contrib_regression=# select lexize
('norwegian_ispell','sjokoladefabrikk');
lexize
--------------------------------------
{sjokoladefabrikk,sjokolade,fabrikk}
(1 row)

contrib_regression=# select lexize
('norwegian_ispell','overtrekksgrilldresser');
lexize
-------------------------
{overtrekk,grill,dress}
(1 row)
% psql -l
List of databases
Name | Owner | Encoding
--------------------+--------+----------
contrib_regression | teodor | KOI8
postgres | pgsql | KOI8
template0 | pgsql | KOI8
template1 | pgsql | KOI8
(4 rows)

I'm afraid that UTF-8 problem. We just committed in CVS HEAD
multibyte support for tsearch2, so you can try it.

Pls, notice, the dict, aff stopword files should be in server
encoding. Snowball sources for german (and other) in UTF8 can be
founded in http://snowball.tartarus.org/dist/libstemmer_c.tgz

To all: May be, we should put all snowball's stemmers (for all
available languages and encodings) to tsearch2 directory?

--
Teodor Sigaev E-mail:
teodor@sigaev.ru
WWW: http://
www.sigaev.ru/

Teodor Sigaev

teodor@sigaev.ru

about 20 years ago

In reply to: Alexander Presber (#8)

Very strange...

~% file tsearch/dict/ispell_no/norwegian.dict
tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
~% file tsearch/dict/ispell_no/norwegian.aff
tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

Can you place that files anywhere wher I can download it (or mail it directly to
me)?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#10

Teodor Sigaev

teodor@sigaev.ru

about 20 years ago

In reply to: Alexander Presber (#8)

BTW, if you take norwegian dictionary from
http://folk.uio.no/runekl/dictionary.html then try to build it from OpenOffice
sources (http://lingucomponent.openoffice.org/spell_dic.html, tsearch2/my2ispell).

I found mails in my archive which says that norwegian people prefer OpenOffice's
one.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#11

Oleg Bartunov

oleg@sai.msu.su

about 20 years ago

In reply to: Teodor Sigaev (#9)

Norwegian (Nynorsk and Bokmaal) ispell dictionaries are available from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

I didn't test them.

Oleg
On Fri, 17 Feb 2006, Teodor Sigaev wrote:

Very strange...

~% file tsearch/dict/ispell_no/norwegian.dict
tsearch/dict/ispell_no/norwegian.dict: ISO-8859 C program text
~% file tsearch/dict/ispell_no/norwegian.aff
tsearch/dict/ispell_no/norwegian.aff: ISO-8859 English text

Can you place that files anywhere wher I can download it (or mail it directly
to me)?

#12

Teodor Sigaev

teodor@sigaev.ru

about 20 years ago

In reply to: Alexander Presber (#8)

Hmm, I have found a small bug:
When there is a compound affix with zero length of search pattern (which
should not be!), ispell dictionary ignores all other compound affixes.
Original afix file contains

flag ~\`:
E > -E,NINGS #~ avskrive > avskrivnings-
Z Y Z Y Z Y > -ZYZYZY,- #- flerezyzyzy > fler-

ZYZYZY makes down other affixes. Thats why my2ispell removes zyzyzy affix...

I fix it in code of dictionary. Try attached patch, I'll apply it on
monday to CVS.

Thanks a lot for persistence.

Re: TSearch2 / German compound words / UTF-8

Attachments: