fulltext search and hunspell

Started by Jens Sauerover 15 years ago5 messagesgeneral
Jump to latest
#1Jens Sauer
jsauer65@googlemail.com

Hey,

I want to use hunspell as a dictionary for the full text search by

* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:

CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);

* changing the config

ALTER TEXT SEARCH CONFIGURATION german
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_hunspell, german_stem;

* now testing the lexizer:

SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
-----------

(1 Zeile)

Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)

The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.

Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?

Thanks

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Jens Sauer (#1)
Re: fulltext search and hunspell

Jens,

could you check affix file for
compoundwords controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg

On Mon, 7 Feb 2011, Jens Sauer wrote:

Hey,

I want to use hunspell as a dictionary for the full text search by

* using PostgresSQL 8.4.7
* installing hunspell-de-de, hunspell-de-med
* creating a dictionary:

CREATE TEXT SEARCH DICTIONARY german_hunspell (
TEMPLATE = ispell,
DictFile = de_de,
AffFile = de_de,
StopWords = german
);

* changing the config

ALTER TEXT SEARCH CONFIGURATION german
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_hunspell, german_stem;

* now testing the lexizer:

SELECT ts_lexize('german_hunspell', 'Schokaladenfarik');
ts_lexize
-----------

(1 Zeile)

Shouldn't it be something like this:
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
(from the 8.4 documentation of PostgreSQL)

The dict and affix files in the tsearch_data directory were
automatically generated by pg_updatedicts.

Is this a problem of the splitting compound word functionality? Should
I use ispell instead of hunspell?

Thanks

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Jens Sauer
jsauer65@googlemail.com
In reply to: Oleg Bartunov (#2)
Re: fulltext search and hunspell

Hey,

thanks for your answer.

First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.

This is an extract of the de_de.affix file:

# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.

PFX U Y 1
PFX U 0 un .

PFX V Y 1
PFX V 0 ver .

SFX F Y 35
[...]

I cannot find "compoundwords controlled z" there, so I manually added it.

[...]
# versions using the GPL may only include the GPL

compoundwords controlled z

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxzäüößáéêàâñESIJANRTOLCDUGMPHBYFVKWQXZÄÜÖÉ-.
[...]

Then I restarted PostgreSQL.

Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat für Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
»/usr/share/postgresql/8.4/tsearch_data/de_de.affix«: »PFX U Y 1
«
SQL-Funktion »ts_debug« Anweisung 1
SQL-Funktion »ts_debug« Anweisung 1

Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...

If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L

instead of "compoundwords controlled z"

I didn't get an error:

SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)

But it seems that the hunspell dictionary is not working for compound words.

Maybe pg_updatedicts has a bug and generates affix files in the wrong format?

Jens

2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:

Show quoted text

Jens,

could you check affix file for
compoundwords  controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Jens Sauer (#3)
Re: fulltext search and hunspell

Jens,

have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:

Hey,

thanks for your answer.

First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.

This is an extract of the de_de.affix file:

# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.

PFX U Y 1
PFX U 0 un .

PFX V Y 1
PFX V 0 ver .

SFX F Y 35
[...]

I cannot find "compoundwords controlled z" there, so I manually added it.

[...]
# versions using the GPL may only include the GPL

compoundwords controlled z

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
[...]

Then I restarted PostgreSQL.

Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat f?r Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
?
SQL-Funktion ?ts_debug? Anweisung 1
SQL-Funktion ?ts_debug? Anweisung 1

Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...

If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L

instead of "compoundwords controlled z"

I didn't get an error:

SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------
asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)

But it seems that the hunspell dictionary is not working for compound words.

Maybe pg_updatedicts has a bug and generates affix files in the wrong format?

Jens

2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:

Jens,

could you check affix file for
compoundwords  controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure dictionaries
actually used.

Oleg

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#5Jens Sauer
jsauer65@googlemail.com
In reply to: Oleg Bartunov (#4)
Re: fulltext search and hunspell

Thanks for this tip,
the german compound directory from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ works fine.
I think the problem was the rudimentary support of hunspell dictionaries.

Thanks for your help and your great software!

Am 08.02.2011 11:34, schrieb Oleg Bartunov:

Show quoted text

Jens,

have you tried german compound dictionary from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Oleg
On Tue, 8 Feb 2011, Jens Sauer wrote:

Hey,

thanks for your answer.

First I checked the links in the tsearch_data directory
de_de.affix, and de_de.dict are symlinks to the corresponding files in
/var/cache/postgresql/dicts/
Then I recreated them by using pg_updatedicts.

This is an extract of the de_de.affix file:

# this is the affix file of the de_DE Hunspell dictionary
# derived from the igerman98 dictionary
#
# Version: 20091006 (build 20100127)
#
# Copyright (C) 1998-2009 Bjoern Jacke <bjoern@j3e.de>
#
# License: GPLv2, GPLv3 or OASIS distribution license agreement
# There should be a copy of both of this licenses included
# with every distribution of this dictionary. Modified
# versions using the GPL may only include the GPL

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.

PFX U Y 1
PFX U 0 un .

PFX V Y 1
PFX V 0 ver .

SFX F Y 35
[...]

I cannot find "compoundwords controlled z" there, so I manually added
it.

[...]
# versions using the GPL may only include the GPL

compoundwords controlled z

SET ISO8859-1
TRY esijanrtolcdugmphbyfvkwqxz??????????ESIJANRTOLCDUGMPHBYFVKWQXZ????-.
[...]

Then I restarted PostgreSQL.

Now I get an error:
SELECT * FROM ts_debug('Schokoladenfabrik');
FEHLER: falsches Affixdateiformat f?r Flag
CONTEXT: Zeile 18 in Konfigurationsdatei
?/usr/share/postgresql/8.4/tsearch_data/de_de.affix?: ?PFX U Y 1
?
SQL-Funktion ?ts_debug? Anweisung 1
SQL-Funktion ?ts_debug? Anweisung 1

Which means:
ERROR: wrong Affixfileformat for flag
CONTEXT: Line 18 in Configuration ...

If I add
COMPOUNDFLAG Z
ONLYINCOMPOUND L

instead of "compoundwords controlled z"

I didn't get an error:

SELECT * FROM ts_debug('Schokoladenfabrik');
alias | description | token |
dictionaries | dictionary | lexemes
-----------+-----------------+-------------------+-------------------------------+-------------+-------------------

asciiword | Word, all ASCII | Schokoladenfabrik |
{german_hunspell,german_stem} | german_stem | {schokoladenfabr}
(1 row)

But it seems that the hunspell dictionary is not working for compound
words.

Maybe pg_updatedicts has a bug and generates affix files in the wrong
format?

Jens

2011/2/7 Oleg Bartunov <oleg@sai.msu.su>:

Jens,

could you check affix file for
compoundwords controlled z

also, can you provide link to dictionary files, so we can check if they
supported, since we have only rudiment support of hunspell.
btw,it'd be nice to have output from ts_debug() to make sure
dictionaries
actually used.

Oleg

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83