Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Started by Mohamedover 17 years ago11 messagesgeneral

mohamed5432154321@gmail.com

over 17 years ago

Hi!
I have just made the switch from MySql to PostgreSql to be able to take
advantage of TSearch2 and I just arrived to the mailing list :) I am
creating a website in two languages (english, arabic) and would like to have
dictionaries for both for my search. I noticed that arabic wasn't in as
default. How can I add it / make it work ?

I am currently saving the english and arabic text in the same relation. So I
guess I could create two indexes over the two dictionaries or should I split
them ?

/ Moe

Sam Mason

sam@samason.me.uk

over 17 years ago

In reply to: Mohamed (#1)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

On Mon, Jan 05, 2009 at 02:49:51PM +0100, Mohamed wrote:

I have just made the switch from MySql to PostgreSql to be able to take
advantage of TSearch2 and I just arrived to the mailing list :) I am
creating a website in two languages (english, arabic) and would like to have
dictionaries for both for my search. I noticed that arabic wasn't in as
default. How can I add it / make it work ?

Not sure about adding different dictionaries; but there was a discussion
about multiple languages in the same relation a month ago:

http://archives.postgresql.org/pgsql-general/2008-11/msg00340.php

if you don't get any more pointers, the following page documents PG's
support of full text searching:

http://www.postgresql.org/docs/current/static/textsearch.html

Sam

Oleg Bartunov

oleg@sai.msu.su

over 17 years ago

In reply to: Mohamed (#1)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

On Mon, 5 Jan 2009, Mohamed wrote:

Hi!
I have just made the switch from MySql to PostgreSql to be able to take
advantage of TSearch2 and I just arrived to the mailing list :) I am
creating a website in two languages (english, arabic) and would like to have
dictionaries for both for my search. I noticed that arabic wasn't in as
default. How can I add it / make it work ?

read documentation
http://www.postgresql.org/docs/8.3/static/textsearch.html
What dictionaries you have for arabic ?

I am currently saving the english and arabic text in the same relation. So I
guess I could create two indexes over the two dictionaries or should I split
them ?

if arabic and english characters are not overlaped, you can use one index.
For example, for english/russian text we have one index

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Mohamed (#1)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

There has been an error made and messages I have written didn't not end up
here so I will do a repost on couple ... :)
-----------------------------------------------------------------------------------------------------------------------------------------------

Thank you Oleg. I am reading that guide. Its a little too much at one time.
I am getting a little confused. I don't have any dictionary yet, but I just
found a hunspell dictionary for Arabic :

http://qa.debian.org/developer.php?login=msameer%40debian.org

Now I see Sam wrote too... :)

I hope you guys will still be around to help me set it up when I have
finished my reading? I have seen earlier posts from you Oleg about
dictionaries, so I suppose you know it pretty well. (Not to mention the big
pdf file you have written !)
I am working on my developer (laptop) now, I prefer if I don't have to
repeat all the work later on.. is there away of avoiding this? To only have
to do these builds once or create some kind of a batch file and run them on
the production server later ( I am afraid I won't remember what I did one
inorder to repeat it)

Btw, you know if it is possible to combine the Tsearch with Hibernate (HQL)
or will I just have to do it all in SQL ?

The more dictionaries I use the better? or should I just choose and use only
one to build my lexemes and stopwords (etc) ?

Oleg :

We usually use {ispell, stemmer} dictionary stack. if you don't have
stemmer for arabic, just use simple dictionary, so if ispell dict doesn't
recognize word, it will be recognized by simple dict and that word will be
indexed.

What do you mean with simple dictionary ? Does that come with postgre ? Is
it possible to do the same with {hunspell, stemmer(simple?)}

/ Moe

Import Notes

Reply to msg id not found: 861fed220901051111l1bb86753gc440abf48c0a834b@mail.gmail.com

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Mohamed (#1)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Ok, thank you all for your help. It has been very valuable. I am starting to
get the hang of it and almost read the whole chapter 12 + extras but I still
need a little bit of guidance.

I have now these files :

- A arabic Hunspell rar file (OpenOffice version) wich includes :
- ar.dic
- ar.aff
- An Aspell rar file that includes alot of files
- A Myspell ( says simple words list )
- And also Andrews two files :
- ar.affix
- ar.stop

I am thinking that I should go with just one of these right and that should
be the Hunspell? There is an ar.aff file there and Andrews file ends with
.affix, are those perhaps similiar? Should I skip Andrews ? Use just the
ar.stop file ?

On the Arabic / English on row basis language search approach, I will skip
and choose the approach suggested by Oleg :

if arabic and english characters are not overlaped, you can use one index.

The Arabic letters and English letters or words don't overlap so that should
not be an issue? Will I be able to index and search against both languages
in the same query?

And also

1. What language files should I use ?
2. How does my create dictionary for the arabic language look like ?
Perhaps like this :

CREATE TEXT SEARCH DICTIONARY arabic_dic(
TEMPLATE = ? , // Not sure what this means
DictFile = ar, // referring to ar.dic (hunspell)
AffFile = ar , // referring to ar.aff (hunspell)
StopWords = ar // referring to Andrews stop file. ( what about
Andrews .affix file ? )

// Anything more ?
);

Thanks again! / Moe

Import Notes

Reply to msg id not found: 861fed220901060254k140965c7hc0d24ac62cda3f4c@mail.gmail.com

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Mohamed (#5)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

no one ?

/ Moe

On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321@gmail.com>wrote:

Show quoted text

Ok, thank you all for your help. It has been very valuable. I am starting
to get the hang of it and almost read the whole chapter 12 + extras but I
still need a little bit of guidance.

I have now these files :

- A arabic Hunspell rar file (OpenOffice version) wich includes :
- ar.dic
- ar.aff
- An Aspell rar file that includes alot of files
- A Myspell ( says simple words list )
- And also Andrews two files :
- ar.affix
- ar.stop

I am thinking that I should go with just one of these right and that should
be the Hunspell? There is an ar.aff file there and Andrews file ends with
.affix, are those perhaps similiar? Should I skip Andrews ? Use just the
ar.stop file ?

On the Arabic / English on row basis language search approach, I will skip
and choose the approach suggested by Oleg :

if arabic and english characters are not overlaped, you can use one index.

The Arabic letters and English letters or words don't overlap so that
should not be an issue? Will I be able to index and search against both
languages in the same query?

And also

1. What language files should I use ?
2. How does my create dictionary for the arabic language look like ?
Perhaps like this :

CREATE TEXT SEARCH DICTIONARY arabic_dic(
TEMPLATE = ? , // Not sure what this means
DictFile = ar, // referring to ar.dic (hunspell)
AffFile = ar , // referring to ar.aff (hunspell)
StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )

// Anything more ?
);

Thanks again! / Moe

Andrew

archa@pacific.net.au

over 17 years ago

In reply to: Mohamed (#6)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Hi Mohammed,

See my answers below, and hopefully they won't lead you too far astray.
Note though, it has been a long time since I have done this and there
are doubtless more knowledgeable people in this forum who will be able
to correct anything I say that may be misleading or incorrect.

Cheers,

Andy

Mohamed wrote:

no one ?

/ Moe

On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321@gmail.com
<mailto:mohamed5432154321@gmail.com>> wrote:

Ok, thank you all for your help. It has been very valuable. I am
starting to get the hang of it and almost read the whole chapter
12 + extras but I still need a little bit of guidance.

I have now these files :

* A arabic Hunspell rar file (OpenOffice version) wich includes :
o ar.dic
o ar.aff
* An Aspell rar file that includes alot of files
* A Myspell ( says simple words list )
* And also Andrews two files :
o ar.affix
o ar.stop

I am thinking that I should go with just one of these right and
that should be the Hunspell?

Hunspell is based on MySpell, extending it with support for complex
compound words and unicode characters, however Postgresql cannot take
advantage of Hunspell's compound word capabilities at present. Aspell
is a GNU dictionary that replaces Ispell and supports UTF-8 characters.
See http://aspell.net/test/ for comparisons between dictionaries, though
be aware this test is hosted by Aspell... I will leave it to others to
argue the merits of Hunspell vs. Aspell, and why you would choose one or
the other.

There is an ar.aff file there and Andrews file ends with .affix,
are those perhaps similiar? Should I skip Andrews ?

The ar.aff file that comes with OpenOffice Hunspell dictionary is
essentially the same as the ar.affix I supplied. Just open the two up,
compare them and choose the one that you feel is best. A Hunspell
dictionary will work better with a corresponding affix file.

Use just the ar.stop file ?

The ar.stop file flags common words from being indexed. You will want a
stop file as well as the dictionary and affix file. Feel free to modify
the stop file to meet your own needs.

On the Arabic / English on row basis language search approach, I
will skip and choose the approach suggested by Oleg :

if arabic and english characters are not overlaped, you can
use one index.

The Arabic letters and English letters or words don't overlap so
that should not be an issue? Will I be able to index and search
against both languages in the same query?

If you want to support multiple language dictionaries for a single
table, with each row associated to its own dictionary, use the
tsvector_update_trigger_column trigger to automatically update your
tsvector indexed column on insert or update. To support this, your
table will need an additional column of type regconfig that contains the
name of the dictionary to use when searching on the tsvector column for
that particular row. See
http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERS
for more details. This will allow you to search across both languages
in the one query as you were asking.

And also

1. What language files should I use ?
2. How does my create dictionary for the arabic language look
like ? Perhaps like this :

CREATE TEXT SEARCH DICTIONARY arabic_dic(
TEMPLATE = ? , // Not sure what this means
DictFile = ar, // referring to ar.dic (hunspell)
AffFile = ar , // referring to ar.aff (hunspell)
StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )

// Anything more ?
);

From psql command line you can find out what templates you have using
the following command:

\dFt

or looking at the contents of the pg_ts_template table.

If choosing a Hunspell or Aspell dictionary, I believe a value of
TEMPLATE = ispell should be okay for you - see
http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
The template provides instructions to postgresql on how to interact with
the dictionary. The rest of the create dictionary statement appears
fine to me.

Show quoted text

Thanks again! / Moe

------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Andrew (#7)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Thank you for you detailed answer. I have learned alot more about this stuff
now :)
As I see it accordingly to the results it's between Hunspell and Aspell. My
Aspell version is 0.6 released 2006. The Hunspell was released in 2008.

When I run the Postgres command \dFt I get the following list :

- ispell
- simple
- snowball
- synonym
- thesaurus

So I set up my dictionary with the ispell as a template and Hunspell/Aspell
files. Now I just have one decision to make :)

Just another thing:

If you want to support multiple language dictionaries for a single table,
with each row associated to its own dictionary

Not really, since the two languages don't overlap, couldn't I set up two
separate dictionaries and index against both on the whole table ? I think
that's what Oleg was refering to. Not sure...

Thanks for all the help / Moe

Ps. I can't Arabic so I can't have a look on the files to decide :O

On Fri, Jan 9, 2009 at 2:14 PM, Andrew <archa@pacific.net.au> wrote:

Show quoted text

Hi Mohammed,

See my answers below, and hopefully they won't lead you too far astray.
Note though, it has been a long time since I have done this and there are
doubtless more knowledgeable people in this forum who will be able to
correct anything I say that may be misleading or incorrect.

Cheers,

Andy

Mohamed wrote:

no one ?

/ Moe

On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321@gmail.com>wrote:

Ok, thank you all for your help. It has been very valuable. I am starting
to get the hang of it and almost read the whole chapter 12 + extras but I
still need a little bit of guidance.

I have now these files :

- A arabic Hunspell rar file (OpenOffice version) wich includes :
- ar.dic
- ar.aff
- An Aspell rar file that includes alot of files
- A Myspell ( says simple words list )
- And also Andrews two files :
- ar.affix
- ar.stop

I am thinking that I should go with just one of these right and that
should be the Hunspell?

Hunspell is based on MySpell, extending it with support for complex
compound words and unicode characters, however Postgresql cannot take
advantage of Hunspell's compound word capabilities at present. Aspell is a
GNU dictionary that replaces Ispell and supports UTF-8 characters. See
http://aspell.net/test/ for comparisons between dictionaries, though be
aware this test is hosted by Aspell... I will leave it to others to argue
the merits of Hunspell vs. Aspell, and why you would choose one or the
other.

There is an ar.aff file there and Andrews file ends with .affix, are

those perhaps similiar? Should I skip Andrews ?

The ar.aff file that comes with OpenOffice Hunspell dictionary is
essentially the same as the ar.affix I supplied. Just open the two up,
compare them and choose the one that you feel is best. A Hunspell
dictionary will work better with a corresponding affix file.

Use just the ar.stop file ?

The ar.stop file flags common words from being indexed. You will want a
stop file as well as the dictionary and affix file. Feel free to modify the
stop file to meet your own needs.

On the Arabic / English on row basis language search approach, I will
skip and choose the approach suggested by Oleg :

if arabic and english characters are not overlaped, you can use one

index.

The Arabic letters and English letters or words don't overlap so that
should not be an issue? Will I be able to index and search against both
languages in the same query?

If you want to support multiple language dictionaries for a single
table, with each row associated to its own dictionary, use the
tsvector_update_trigger_column trigger to automatically update your tsvector
indexed column on insert or update. To support this, your table will need
an additional column of type regconfig that contains the name of the
dictionary to use when searching on the tsvector column for that particular
row. See
http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERSfor more details. This will allow you to search across both languages in
the one query as you were asking.

And also

1. What language files should I use ?
2. How does my create dictionary for the arabic language look like ?
Perhaps like this :

CREATE TEXT SEARCH DICTIONARY arabic_dic(
TEMPLATE = ? , // Not sure what this means
DictFile = ar, // referring to ar.dic (hunspell)
AffFile = ar , // referring to ar.aff (hunspell)
StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )

// Anything more ?
);

From psql command line you can find out what templates you have using the
following command:

\dFt

or looking at the contents of the pg_ts_template table.

If choosing a Hunspell or Aspell dictionary, I believe a value of TEMPLATE
= ispell should be okay for you - see
http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
The template provides instructions to postgresql on how to interact with the
dictionary. The rest of the create dictionary statement appears fine to me.

Thanks again! / Moe

------------------------------
No virus found in this incoming message.
Checked by AVG - http://www.avg.com

Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Mohamed (#8)

When I run the Postgres command \dFt I get the following list :

- ispell
- simple
- snowball
- synonym
- thesaurus

So I set up my dictionary with the ispell as a template and Hunspell/Aspell
files. Now I just have one decision to make :)

Just another thing:

If you want to support multiple language dictionaries for a single table,
with each row associated to its own dictionary

Not really, since the two languages don't overlap, couldn't I set up two
separate dictionaries and index against both on the whole table ? I think
that's what Oleg was refering to. Not sure...

Thanks for all the help / Moe

Ps. I can't read Arabic so I can't have a look on the files to decide :O

On Fri, Jan 9, 2009 at 2:14 PM, Andrew <archa@pacific.net.au> wrote:

Show quoted text

Hi Mohammed,

See my answers below, and hopefully they won't lead you too far astray.
Note though, it has been a long time since I have done this and there are
doubtless more knowledgeable people in this forum who will be able to
correct anything I say that may be misleading or incorrect.

Cheers,

Andy

Mohamed wrote:

no one ?

/ Moe

On Thu, Jan 8, 2009 at 11:46 AM, Mohamed <mohamed5432154321@gmail.com>wrote:

Ok, thank you all for your help. It has been very valuable. I am starting
to get the hang of it and almost read the whole chapter 12 + extras but I
still need a little bit of guidance.

I have now these files :

- A arabic Hunspell rar file (OpenOffice version) wich includes :
- ar.dic
- ar.aff
- An Aspell rar file that includes alot of files
- A Myspell ( says simple words list )
- And also Andrews two files :
- ar.affix
- ar.stop

I am thinking that I should go with just one of these right and that
should be the Hunspell?

Hunspell is based on MySpell, extending it with support for complex
compound words and unicode characters, however Postgresql cannot take
advantage of Hunspell's compound word capabilities at present. Aspell is a
GNU dictionary that replaces Ispell and supports UTF-8 characters. See
http://aspell.net/test/ for comparisons between dictionaries, though be
aware this test is hosted by Aspell... I will leave it to others to argue
the merits of Hunspell vs. Aspell, and why you would choose one or the
other.

There is an ar.aff file there and Andrews file ends with .affix, are

those perhaps similiar? Should I skip Andrews ?

The ar.aff file that comes with OpenOffice Hunspell dictionary is
essentially the same as the ar.affix I supplied. Just open the two up,
compare them and choose the one that you feel is best. A Hunspell
dictionary will work better with a corresponding affix file.

Use just the ar.stop file ?

The ar.stop file flags common words from being indexed. You will want a
stop file as well as the dictionary and affix file. Feel free to modify the
stop file to meet your own needs.

On the Arabic / English on row basis language search approach, I will
skip and choose the approach suggested by Oleg :

if arabic and english characters are not overlaped, you can use one

index.

The Arabic letters and English letters or words don't overlap so that
should not be an issue? Will I be able to index and search against both
languages in the same query?

If you want to support multiple language dictionaries for a single
table, with each row associated to its own dictionary, use the
tsvector_update_trigger_column trigger to automatically update your tsvector
indexed column on insert or update. To support this, your table will need
an additional column of type regconfig that contains the name of the
dictionary to use when searching on the tsvector column for that particular
row. See
http://www.postgresql.org/docs/current/static/textsearch-features.html#TEXTSEARCH-UPDATE-TRIGGERSfor more details. This will allow you to search across both languages in
the one query as you were asking.

And also

1. What language files should I use ?
2. How does my create dictionary for the arabic language look like ?
Perhaps like this :

CREATE TEXT SEARCH DICTIONARY arabic_dic(
TEMPLATE = ? , // Not sure what this means
DictFile = ar, // referring to ar.dic (hunspell)
AffFile = ar , // referring to ar.aff (hunspell)
StopWords = ar // referring to Andrews stop file. ( what about Andrews .affix file ? )

// Anything more ?
);

From psql command line you can find out what templates you have using the
following command:

\dFt

or looking at the contents of the pg_ts_template table.

If choosing a Hunspell or Aspell dictionary, I believe a value of TEMPLATE
= ispell should be okay for you - see
http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY.
The template provides instructions to postgresql on how to interact with the
dictionary. The rest of the create dictionary statement appears fine to me.

Thanks again! / Moe

------------------------------
No virus found in this incoming message.
Checked by AVG - http://www.avg.com

Version: 8.0.176 / Virus Database: 270.10.3/1879 - Release Date: 1/6/2009 5:16 PM

#10

Andrew

archa@pacific.net.au

over 17 years ago

In reply to: Mohamed (#9)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

Mohamed wrote:

Thank you for you detailed answer. I have learned alot more about this
stuff now :)

Your welcome :-)

As I see it accordingly to the results it's between Hunspell and
Aspell. My Aspell version is 0.6 released 2006. The Hunspell was
released in 2008.

When I run the Postgres command \dFt I get the following list :

* ispell
* simple
* snowball
* synonym
* thesaurus

So I set up my dictionary with the ispell as a template and
Hunspell/Aspell files. Now I just have one decision to make :)

Just another thing:

If you want to support multiple language dictionaries for a single
table, with each row associated to its own dictionary

Not really, since the two languages don't overlap, couldn't I set up
two separate dictionaries and index against both on the whole table ?
I think that's what Oleg was refering to. Not sure...

Neither am I, so when in doubt, try it out. And let us know the results.

Thanks for all the help / Moe

Ps. I can't read Arabic so I can't have a look on the files to decide :O

In which case, assuming you do not have access to a friend who is able
to read Arabic, either choose the file with the most entries (making
assumption that more is better) or take the one that came with the
dictionary (assuming that those two will be best matched) or if you
still can't decide, flip a coin. As you can't read Arabic, it is not as
if you are in a position to put both files through their paces and test
them against a word list, picking the one that gives you the best
results for the type of words your text is likely to contain.

Cheers,

Andy

#11

Mohamed

mohamed5432154321@gmail.com

over 17 years ago

In reply to: Andrew (#10)

Re: Adding Arabic dictionary for TSearch2.. to_tsvector('arabic'...) doesn't work..

I finally got around to build a configuration but the results are not good
at all and a bit odd.

Here is what I did:

I built the configuration with the hunspell + an Arabic simple dictionary
(with just the stop words as an input) because I noticed that words not
recognized will still get returned back.

Removed from the affix file :
Flag long

CREATE TEXT SEARCH DICTIONARY hunar (
TEMPLATE = ispell,
DictFile = hunar,
AffFile = hunar,
StopWords = ar
);

CREATE TEXT SEARCH DICTIONARY ar_simple (
TEMPLATE = pg_catalog.simple, //Not sure what this is or does
STOPWORDS = ar
);

CREATE TEXT SEARCH CONFIGURATION hunarconfig ( COPY = pg_catalog.english );

ALTER TEXT SEARCH CONFIGURATION hunarconfig
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH hunar, ar_simple;

Running :

SELECT * FROM ts_debug('hunarconfig', '
وفي هذا الإطار أجرى رئيس الوزراء القطري الشيخ حمد بن جاسم بن جبر آل ثاني
محادثات في لندن مع نظيره البريطاني غوردون براون تناولت الأوضاع الأمنية في
الشرق الأوسط وتطرقت المباحثات إلى سبل تثبيت وقف إطلاق النار في قطاع غزة
وعملية إعادة إعمار وبناء القطاع بعد الحرب الإسرائيلية الأخيرة.
');

returned odd results ( I think). Not many was recognized by the hunar
dictionary and
some stopwords where recognized by the latter dictionary ar_simple even
though the same stopwords file was used in the hunar dictionary. Should I
not expect the stopwords to be recognized by hunar and not ar_simple ?

Here is a small sample that shows what I mean (with comments) :

*"**وفي"; "{hunar,ar_simple}"; "hunar"; "{}"
// Recognized stop word by hunar dictionary*

*"**هذا"; "{hunar,ar_simple}"; "ar_simple"; "{}"
// Recognized stop word but by ar_simple ? WHY?*

*"**أجرى"; "{hunar,ar_simple}"; "ar_simple"; "{**أجرى}"
// Not recognized by any, return*

Is this not strange? Shouldn't the first dictionary (hunar) return the
stopwords recognized and not ar_simple?

/ Moe

On Sat, Jan 10, 2009 at 11:14 AM, Andrew <archa@pacific.net.au> wrote:

Show quoted text

Mohamed wrote:

Thank you for you detailed answer. I have learned alot more about this
stuff now :)

Your welcome :-)

As I see it accordingly to the results it's between Hunspell and Aspell.
My Aspell version is 0.6 released 2006. The Hunspell was released in 2008.

When I run the Postgres command \dFt I get the following list :

- ispell
- simple
- snowball
- synonym
- thesaurus

So I set up my dictionary with the ispell as a template and
Hunspell/Aspell files. Now I just have one decision to make :)

Just another thing:

If you want to support multiple language dictionaries for a single table,
with each row associated to its own dictionary

Not really, since the two languages don't overlap, couldn't I set up two
separate dictionaries and index against both on the whole table ? I think
that's what Oleg was refering to. Not sure...

Neither am I, so when in doubt, try it out. And let us know the results.

Thanks for all the help / Moe

Ps. I can't read Arabic so I can't have a look on the files to decide :O

In which case, assuming you do not have access to a friend who is able
to read Arabic, either choose the file with the most entries (making
assumption that more is better) or take the one that came with the
dictionary (assuming that those two will be best matched) or if you still
can't decide, flip a coin. As you can't read Arabic, it is not as if you
are in a position to put both files through their paces and test them
against a word list, picking the one that gives you the best results for the
type of words your text is likely to contain.

Cheers,

Andy