Hunspell as filtering dictionary
Hi,
I am trying to create a ts_vector from a French text. Here are the
operations that seem logical to perform in that order:
1. remove stopwords
2. use hunspell to find words roots
3. unaccent
I first tried:
CREATE TEXT SEARCH CONFIGURATION fr_conf (copy='simple');
ALTER TEXT SEARCH CONFIGURATION fr_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH unaccent, french_hunspell;
select * from to_tsvector('fr_conf', E'Pour découvrir et rencontrer
l\'aventure.');
-- 'aventure':5 'aventurer':5 'rencontrer':3
But the verb "découvrir" is missing :(
If I try with french_hunspell only, I get it, but with the accent:
select * from to_tsvector('french_hunspell', E'Pour découvrir et rencontrer
l\'aventure.');
-- 'aventure':6 'aventurer':6 'découvrir':2 'rencontrer':4
I also tried:
CREATE TEXT SEARCH CONFIGURATION fr_conf2 (copy='simple');
ALTER TEXT SEARCH CONFIGURATION fr_conf2
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH french_hunspell, unaccent;
select * from to_tsvector('fr_conf2', E'Pour découvrir et rencontrer
l\'aventure.');
-- 'aventure':5 'aventurer':5 'rencontrer':3
But I guess unaccent is never called.
I believe this is because french_hunspell is not a filtering dictionary,
but I might be wrong. So is there a way to get this result from any FTS
configuration (existing or :
-- 'aventure':6 'aventurer':6 'decouvrir':2 'rencontrer':4
Thanks,
Bertrand
On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote@gmail.com> wrote:
Hi,
I am trying to create a ts_vector from a French text. Here are the
operations that seem logical to perform in that order:1. remove stopwords
2. use hunspell to find words roots
3. unaccent
I can't speak to French, but we use a similar configuration in English,
with unaccent first, then hunspell. We found that there were words that
hunspell didn't recognise, but instead pulled apart (for example,
"contract" became "con" and "tract"), so I wonder if something similar is
happening with "découvrir." To solve this, we put a custom dictionary with
these terms in front of hunspell. Unaccent definitely has to be called
first. We also modified hunspell with a custom stopwords file, to eliminate
select other terms, such as profanities:
-- We use a custom stopwords file, to filter out other terms, such as
profanities
ALTER TEXT SEARCH DICTIONARY
hunspell_en_ca (
Stopwords = our_custom_stopwords
);
-- Adding english_stem allows us to recognize words which hunspell
-- doesn't, particularly acronyms such as CGA
ALTER TEXT SEARCH CONFIGURATION
our_configuration
ALTER MAPPING FOR
asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH
unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
;
There was definitely a fair bit of trial and error to determine the correct
order and configuration.
Thanks. The problem is that the hunspell dictionary doesn't work with
unaccent so it is actually totally useless for languages with accents. If
one has to rely on stemming for words with accents, it is just a partial
solution and it is not the right solution.
Besides, the results returned by the hunspell implementation in postgresql
are incorrect. As you mentioned, it shouldn't return "con" and "tract" for
"contract". I also noticed many other weird results with other words in
French. They might have a bug in their code.
I ended up using ts_debug() with a simple stopword file in my own tokenizer
written with pllua that calls libhunspell directly using luajit and ffi. I
also wrote my own unaccent in Lua using the unaccent extension rules. It is
now two times faster to index French text and it gives much better results.
It produces a tsvector. Words returned by libhunspell stem() function get a
lower weight D and keep the same position as the original word.
My conclusion is that hunspell in postgres is useless for me at least
because it should be a filtering dictionary and it produces strange results
that pollute the original text.
I also think that the current implementation of TEXT SEARCH configuration
is not usable for serious purposes. It is too limited. Solr configuration,
while more complex, does a much better job.
Le mer. 6 nov. 2019 à 16:50, Hugh Ranalli <hugh@whtc.ca> a écrit :
Show quoted text
On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote@gmail.com> wrote:
Hi,
I am trying to create a ts_vector from a French text. Here are the
operations that seem logical to perform in that order:1. remove stopwords
2. use hunspell to find words roots
3. unaccentI can't speak to French, but we use a similar configuration in English,
with unaccent first, then hunspell. We found that there were words that
hunspell didn't recognise, but instead pulled apart (for example,
"contract" became "con" and "tract"), so I wonder if something similar is
happening with "découvrir." To solve this, we put a custom dictionary with
these terms in front of hunspell. Unaccent definitely has to be called
first. We also modified hunspell with a custom stopwords file, to eliminate
select other terms, such as profanities:-- We use a custom stopwords file, to filter out other terms, such as
profanities
ALTER TEXT SEARCH DICTIONARY
hunspell_en_ca (
Stopwords = our_custom_stopwords
);-- Adding english_stem allows us to recognize words which hunspell
-- doesn't, particularly acronyms such as CGA
ALTER TEXT SEARCH CONFIGURATION
our_configuration
ALTER MAPPING FOR
asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH
unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
;There was definitely a fair bit of trial and error to determine the
correct order and configuration.