Re: [OpenFTS-general] AW: tsearch2, ispell, utf-8 and german special characters
Hi!
-----Ursprüngliche Nachricht-----
Von: openfts-general-admin@lists.sourceforge.net
[mailto:openfts-general-admin@lists.sourceforge.net] Im
Auftrag von Markus Wollny
Gesendet: Mittwoch, 21. Juli 2004 17:04
An: Oleg Bartunov
Cc: pgsql-general@postgresql.org;
openfts-general@lists.sourceforge.net
Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
utf-8 and german special characters
The issue with the unrecognized stop-word 'ein' which is
converted by to_tsvector to 'eint' remains however. Now
here's as much detail as I can provide:Ispell is Version 3.1.20 10/10/95, patch 1.
I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't help; by now I think it might be something to do with a german language peculiarity or with something in the german dictionary. In german.med, there is an entry
eint/EGPVWX
So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing the lookup in the Ispell-dictionary?
Kind regards
Markus Wollny
On Wed, 21 Jul 2004, Markus Wollny wrote:
Hi!
-----Urspr?ngliche Nachricht-----
Von: openfts-general-admin@lists.sourceforge.net
[mailto:openfts-general-admin@lists.sourceforge.net] Im
Auftrag von Markus Wollny
Gesendet: Mittwoch, 21. Juli 2004 17:04
An: Oleg Bartunov
Cc: pgsql-general@postgresql.org;
openfts-general@lists.sourceforge.net
Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
utf-8 and german special charactersThe issue with the unrecognized stop-word 'ein' which is
converted by to_tsvector to 'eint' remains however. Now
here's as much detail as I can provide:Ispell is Version 3.1.20 10/10/95, patch 1.
I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't help; by now I think it might be something to do with a german language peculiarity or with something in the german dictionary. In german.med, there is an entry
ispell itself don't used in tsearch2, only dict,aff files !
eint/EGPVWX
So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing the lookup in the Ispell-dictionary?
yes. There is very usefull function for debugging I always recommend to use -
ts_debug. See my notes (http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes)
for examples.
Kind regards
Markus Wollny
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id040&op?k
_______________________________________________
OpenFTS-general mailing list
OpenFTS-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openfts-general
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Hi!
ts2test=# select * from ts_debug('Jeden Tag wird man ein bisschen weiser');
ts_name | tok_type | description | token | dict_name | tsvector
----------------+----------+-------------+----------+-------------+------------
default_german | lword | Latin word | Jeden | {de_ispell} |
default_german | lword | Latin word | Tag | {de_ispell} | 'tag'
default_german | lword | Latin word | wird | {de_ispell} |
default_german | lword | Latin word | man | {de_ispell} |
default_german | lword | Latin word | ein | {de_ispell} | 'eint'
default_german | lword | Latin word | bisschen | {de_ispell} | 'bisschen'
default_german | lword | Latin word | weiser | {de_ispell} | 'weise'
(7 rows)
cat german.stop|grep ^ein$
ein
'jeden', 'man', 'wird' and 'ein' are all in german.stop; the first three words are correctly recognozed as stopwords, whereas the last one is converted to 'eint', although 'ein' is a stopword, too. I still don't understand what exactly is happening and if I should be concerned by that sort of "wrong guess" - so 'ein' is just converted to 'eint' every time, no matter if it's in the stopwords-file or not, but on the other hand, as this applies to to_tsvector(), to_tsquery() and lexize(), this behaviour would be consitant throughout tsearch2 - thus making any search containing 'ein' a little bit fuzzier, but nonetheless still usable. It's still some sort of cosmetic bug, though, but I guess that's probably due to German being somewhat less IT-friendly than english.
Kind regards
Markus
-----Original Message-----
From: Oleg Bartunov [mailto:oleg@sai.msu.su]
Sent: Wed 7/21/2004 22:24
To: Markus Wollny
Cc: pgsql-general@postgresql.org; openfts-general@lists.sourceforge.net
Subject: Re: AW: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell, utf-8 and german special characters
On Wed, 21 Jul 2004, Markus Wollny wrote:
Hi!
-----Urspr?ngliche Nachricht-----
Von: openfts-general-admin@lists.sourceforge.net
[mailto:openfts-general-admin@lists.sourceforge.net] Im
Auftrag von Markus Wollny
Gesendet: Mittwoch, 21. Juli 2004 17:04
An: Oleg Bartunov
Cc: pgsql-general@postgresql.org;
openfts-general@lists.sourceforge.net
Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
utf-8 and german special charactersThe issue with the unrecognized stop-word 'ein' which is
converted by to_tsvector to 'eint' remains however. Now
here's as much detail as I can provide:Ispell is Version 3.1.20 10/10/95, patch 1.
I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't help; by now I think it might be something to do with a german language peculiarity or with something in the german dictionary. In german.med, there is an entry
ispell itself don't used in tsearch2, only dict,aff files !
eint/EGPVWX
So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing the lookup in the Ispell-dictionary?
yes. There is very usefull function for debugging I always recommend to use -
ts_debug. See my notes (http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes)
for examples.
Kind regards
Markus Wollny
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id040&op?k
_______________________________________________
OpenFTS-general mailing list
OpenFTS-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openfts-general
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
Import Notes
Resolved by subject fallback
Markus,
I was not quite correct - different dictionaries hanlde stop words in
different way ! Stemmers checked before, while ispell - after normalization.
So, in your case, you need 'eint' listed in stop word list.
Oleg
On Wed, 21 Jul 2004, Markus Wollny wrote:
Hi!
ts2test=# select * from ts_debug('Jeden Tag wird man ein bisschen weiser');
ts_name | tok_type | description | token | dict_name | tsvector
----------------+----------+-------------+----------+-------------+------------
default_german | lword | Latin word | Jeden | {de_ispell} |
default_german | lword | Latin word | Tag | {de_ispell} | 'tag'
default_german | lword | Latin word | wird | {de_ispell} |
default_german | lword | Latin word | man | {de_ispell} |
default_german | lword | Latin word | ein | {de_ispell} | 'eint'
default_german | lword | Latin word | bisschen | {de_ispell} | 'bisschen'
default_german | lword | Latin word | weiser | {de_ispell} | 'weise'
(7 rows)cat german.stop|grep ^ein$
ein'jeden', 'man', 'wird' and 'ein' are all in german.stop; the first three words are correctly recognozed as stopwords, whereas the last one is converted to 'eint', although 'ein' is a stopword, too. I still don't understand what exactly is happening and if I should be concerned by that sort of "wrong guess" - so 'ein' is just converted to 'eint' every time, no matter if it's in the stopwords-file or not, but on the other hand, as this applies to to_tsvector(), to_tsquery() and lexize(), this behaviour would be consitant throughout tsearch2 - thus making any search containing 'ein' a little bit fuzzier, but nonetheless still usable. It's still some sort of cosmetic bug, though, but I guess that's probably due to German being somewhat less IT-friendly than english.
Kind regards
Markus
-----Original Message-----
From: Oleg Bartunov [mailto:oleg@sai.msu.su]
Sent: Wed 7/21/2004 22:24
To: Markus Wollny
Cc: pgsql-general@postgresql.org; openfts-general@lists.sourceforge.net
Subject: Re: AW: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell, utf-8 and german special characters
On Wed, 21 Jul 2004, Markus Wollny wrote:Hi!
-----Urspr?ngliche Nachricht-----
Von: openfts-general-admin@lists.sourceforge.net
[mailto:openfts-general-admin@lists.sourceforge.net] Im
Auftrag von Markus Wollny
Gesendet: Mittwoch, 21. Juli 2004 17:04
An: Oleg Bartunov
Cc: pgsql-general@postgresql.org;
openfts-general@lists.sourceforge.net
Betreff: [OpenFTS-general] AW: [GENERAL] tsearch2, ispell,
utf-8 and german special charactersThe issue with the unrecognized stop-word 'ein' which is
converted by to_tsvector to 'eint' remains however. Now
here's as much detail as I can provide:Ispell is Version 3.1.20 10/10/95, patch 1.
I've just upgraded Ispell to the latest version (International Ispell Version 3.2.06 08/01/01), but that didn't help; by now I think it might be something to do with a german language peculiarity or with something in the german dictionary. In german.med, there is an entry
ispell itself don't used in tsearch2, only dict,aff files !
eint/EGPVWX
So the ts_vector output is just a bit like a wrong guess. Doesn't it evaluate the stopword-list first before doing the lookup in the Ispell-dictionary?
yes. There is very usefull function for debugging I always recommend to use -
ts_debug. See my notes (http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes)
for examples.Kind regards
Markus Wollny
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id040&op?k
_______________________________________________
OpenFTS-general mailing list
OpenFTS-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openfts-generalRegards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83