fulltext search and hunspell
Hey,
I want to use hunspell as a dictionary for the full text search by
* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:
CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);
* changing the config
ALTER TEXT SEARCH CONFIGURATION german
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_hunspell, german_stem;
* now testing the lexizer:
SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
-----------
(1 Zeile)
Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)
The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.
Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?
Thanks
Jens,
could you check affix file for
compoundwords controlled z
also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.
Oleg
On Mon, 7 Feb 2011, Jens Sauer wrote:
Hey,
I want to use hunspell as a dictionary for the full text search by
* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);* changing the config
ALTER TEXT SEARCH CONFIGURATION german
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_hunspell, german_stem;* now testing the lexizer:
SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
-----------(1 Zeile)
Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?Thanks
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Hey,
thanks for your answer.
First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.
This is an extract of the de_de.affix file:
# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
PFX U Y 1
PFX U 0 un .
PFX V Y 1
PFX V 0 ver .
SFX F Y 35
[...]
I cannot find "compoundwords controlled z" there, so I manually added it.
[...]
# versions using the GPL may only include the GPL
compoundwords controlled z
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
[...]
Then I restarted PostgreSQL.
Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat für Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
»/usr/share/postgresql/8.4/tsearch_data/de_de.affix«: »PFX U Y 1
«
SQL-Funktion »ts_debug« Anweisung 1
SQL-Funktion »ts_debug« Anweisung 1
Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...
If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L
instead of "compoundwords controlled z"
I didn't get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)
But it seems that the hunspell dictionary is not working for compound words.
Maybe pg_updatedicts has a bug and generates affix files in the wrong format?
Jens
2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
Show quoted text
Jens,
could you check affix file for
compoundwords controlled zalso, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.Oleg
Jens,
have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:
Hey,
thanks for your answer.
First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.This is an extract of the de_de.affix file:
# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPLSET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.PFX U Y 1
PFX U 0 un .PFX V Y 1
PFX V 0 ver .SFX F Y 35
[...]I cannot find "compoundwords controlled z" there, so I manually added it.
[...]
# versions using the GPL may only include the GPLcompoundwords controlled z
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
[...]Then I restarted PostgreSQL.
Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat f?r Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
?
SQL-Funktion ?ts_debug? Anweisung 1
SQL-Funktion ?ts_debug? Anweisung 1Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND Linstead of "compoundwords controlled z"
I didn't get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)But it seems that the hunspell dictionary is not working for compound words.
Maybe pg_updatedicts has a bug and generates affix files in the wrong format?
Jens
2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
Jens,
could you check affix file for
compoundwords controlled zalso, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.Oleg
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Thanks for this tip,
the german compound directory from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ works fine.
I think the problem was the rudimentary support of hunspell dictionaries.
Thanks for your help and your great software!
Am 08.02.2011 11:34, schrieb Oleg Bartunov:
Show quoted text
Jens,
have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:Hey,
thanks for your answer.
First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.This is an extract of the de_de.affix file:
# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPLSET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.PFX U Y 1
PFX U 0 un .PFX V Y 1
PFX V 0 ver .SFX F Y 35
[...]I cannot find "compoundwords controlled z" there, so I manually added
it.[...]
# versions using the GPL may only include the GPLcompoundwords controlled z
SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
[...]Then I restarted PostgreSQL.
Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat f?r Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
?
SQL-Funktion ?ts_debug? Anweisung 1
SQL-Funktion ?ts_debug? Anweisung 1Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND Linstead of "compoundwords controlled z"
I didn't get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)But it seems that the hunspell dictionary is not working for compound
words.Maybe pg_updatedicts has a bug and generates affix files in the wrong
format?Jens
2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:
Jens,
could you check affix file for
compoundwords controlled zalso, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure
dictionaries
actually used.Oleg
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83