Fulltext search configuration
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic stop
file.
CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);
1) The problem is that the hunspell contains a .dic and a .aff file but the
configuration requeries a .dict and .affix file. I have tried to change the
endings but with no success.
2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.
Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.
Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...
How does a search look like? Does it match against the gin/gist index. Have
that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?
/ Moe
Import Notes
Reply to msg id not found: 861fed220902020227u596bf14dx71059435772aafa8@mail.gmail.comReference msg id not found: 861fed220902020227u596bf14dx71059435772aafa8@mail.gmail.com
Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords file.
Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:
CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);
I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration �
/usr/share/pgsql/tsearch_data/ar_utf8.affix � : � PFX Aa Y 40
(which means Bad format of Affix file for flag, line 42 of configuration
file)
Do you have an error when creating your dictionary?
Daniel
Mohamed a �crit :
Show quoted text
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.
To create a dictionary I have a hunspell dictionary and and arabic
stop file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff
file but the configuration requeries a .dict and .affix file. I
have tried to change the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not
recognized by any dictionary it will not be indexed. I find that
troublesome. I would like everything but the stop words to be
indexed. I guess this might be a step that I am not ready for yet,
but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to
configuration, index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or
is the dictionary only used on search frases?/ Moe
No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :
ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40
/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:
Show quoted text
Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration «
/usr/share/pgsql/tsearch_data/ar_utf8.affix » : « PFX Aa Y 40(which means Bad format of Affix file for flag, line 42 of configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a écrit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?/ Moe
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell) .aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40(which means Bad format of Affix file for flag, line 42 of configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized by
any dictionary it will not be indexed. I find that troublesome. I would like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell aswell:
http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08
And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
Y 6
"
/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Show quoted text
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40(which means Bad format of Affix file for flag, line 42 of configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized
by
any dictionary it will not be indexed. I find that troublesome. I would
like
everything but the stop words to be indexed. I guess this might be a step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytes
Little to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.
Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:
Show quoted text
Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
Y 6
"/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40(which means Bad format of Affix file for flag, line 42 of configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized
by
any dictionary it will not be indexed. I find that troublesome. I would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Mohamed,
comment line in ar.affix
#FLAG long
and creation of ispell dictionary will work.
This is temp, solution.
Teodor is working on fixing affix autorecognizing.
I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytesLittle to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> wrote:
Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX 1013
Y 6
"/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y 40(which means Bad format of Affix file for flag, line 42 of configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not recognized
by
any dictionary it will not be indexed. I find that troublesome. I would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist index.
Have that index been built up using the dictionary/configuration, or is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portal
just made the change you said and created it successfully and tried this :
select ts_lexize('ayaspell', 'استشهد فلسطيني وأصيب ثلاثة في غارة إسرائيلية
جديدة')
but I got nothing... :(
Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)
/ Moe
On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Show quoted text
Mohamed,
comment line in ar.affix
#FLAG long
and creation of ispell dictionary will work. This is temp, solution. Teodor
is working on fixing affix autorecognizing.I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytesLittle to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
wrote:Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
1013
Y 6
"/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the
stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y
40(which means Bad format of Affix file for flag, line 42 of
configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic
stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file
but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not
recognized
by
any dictionary it will not be indexed. I find that troublesome. I
would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to
configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist
index.
Have that index been built up using the dictionary/configuration, or
is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 2 Feb 2009, Mohamed wrote:
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portaljust made the change you said and created it successfully and tried this :
select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
?????')but I got nothing... :(
Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
information, else we can't help you.
Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)
yes
/ Moe
On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
comment line in ar.affix
#FLAG long
and creation of ispell dictionary will work. This is temp, solution. Teodor
is working on fixing affix autorecognizing.I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytesLittle to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
wrote:Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
1013
Y 6
"/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured there
must
be an error somehow.
I think we are using the same dictionary + that I am using the
stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y
40(which means Bad format of Affix file for flag, line 42 of
configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic
stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file
but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not
recognized
by
any dictionary it will not be indexed. I find that troublesome. I
would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to
configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist
index.
Have that index been built up using the dictionary/configuration, or
is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, 2 Feb 2009, Oleg Bartunov wrote:
On Mon, 2 Feb 2009, Mohamed wrote:
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portaljust made the change you said and created it successfully and tried this :
select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ?????????
?????')but I got nothing... :(
Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
information, else we can't help you.Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)yes
Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."
quick example:
CREATE TEXT SEARCH CONFIGURATION arabic (
COPY = english
);
=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stem
Then you can alter this configuration.
/ Moe
On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
comment line in ar.affix
#FLAG long
and creation of ispell dictionary will work. This is temp, solution.
Teodor
is working on fixing affix autorecognizing.I can't say anything about testing, since somebody should provide
first test case. I don't know how to type arabic :)Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
Oleg, like I mentioned earlier. I have a different .affix file that I got
from Andrew with the stop file and I get no errors creating the
dictionary
using that one but I get nothing out from ts_lexize.
The size on that one is : 406,219 bytes
And the size on the hunspell one (first) : 406,229 bytesLittle to close, don't you think ?
It might be that the arabic hunspell (ayaspell) affix file is damaged on
some lines and I got the fixed one from Andrew.Just wanted to let you know.
/ Moe
On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com>
wrote:Ok, thank you Oleg.
I have another dictionary package which is a conversion to hunspell
aswell:http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29
(Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08And running that gives me this error : (again the affix file)
ERROR: wrong affix file format for flag
CONTEXT: line 560 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX
1013
Y 6
"/ Moe
On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
Mohamed,
We are looking on the problem.
Oleg
On Mon, 2 Feb 2009, Mohamed wrote:
No, I don't. But the ts_lexize don't return anything so I figured
theremust
be an error somehow.
I think we are using the same dictionary + that I am using the
stopwords
file and a different affix file, because using the hunspell (ayaspell)
.aff
gives me this error :ERROR: wrong affix file format for flag
CONTEXT: line 42 of configuration file "C:/Program
Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40/ Moe
On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello <
daniel.chiaramello@golog.net> wrote:Hi Mohamed.
I don't know where you get the dictionary - I unsuccessfully tried
the
OpenOffice one by myself (the Ayaspell one), and I had no arabic
stopwords
file.Renaming the file is supposed to be enough (I did it successfully for
Thailandese dictionary) - the ".aff'" file becoming the ".affix" one.
When I tried to create the dictionary:CREATE TEXT SEARCH DICTIONARY ar_ispell (
TEMPLATE = ispell,
DictFile = ar_utf8,
AffFile = ar_utf8,
StopWords = english
);I had an error:
ERREUR: mauvais format de fichier affixe pour le drapeau
CONTEXTE : ligne 42 du fichier de configuration ?
/usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y
40(which means Bad format of Affix file for flag, line 42 of
configuration
file)Do you have an error when creating your dictionary?
Daniel
Mohamed a ?crit :
I have ran into some problems here.
I am trying to implement arabic fulltext search on three columns.To create a dictionary I have a hunspell dictionary and and arabic
stop
file.CREATE TEXT SEARCH DICTIONARY hunspell_dic (
TEMPLATE = ispell,
DictFile = hunarabic,
AffFile = hunarabic,
StopWords = arabic
);1) The problem is that the hunspell contains a .dic and a .aff file
but
the configuration requeries a .dict and .affix file. I have tried to
change
the endings but with no success.2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing
3) How can I convert my .dic and .aff to valid .dict and .affix ?
4) I have read that when using dictionaries, if a word is not
recognized
by
any dictionary it will not be indexed. I find that troublesome. I
would
like
everything but the stop words to be indexed. I guess this might be a
step
that I am not ready for yet, but just wanted to put it out there.Also I would like to know how the process of the fulltext search
implementation looks like, from config to search.Create dictionary, then a text configuration, add dic to
configuration,
index columns with gin or gist ...How does a search look like? Does it match against the gin/gist
index.
Have that index been built up using the dictionary/configuration, or
is
the
dictionary only used on search frases?/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:
On Mon, 2 Feb 2009, Mohamed wrote:
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portaljust made the change you said and created it successfully and tried this
:select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ????
?????????
?????')but I got nothing... :(
Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
information, else we can't help you.
What I expected was something to be returned. After all they are valid words
taken from an article. (perhaps you don't see the words, but only ???... )
Am I wrong to expect something ? Should I go for setting up the
configuration completly first?
SELECT ts_lexize('norwegian_ispell',
'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}
Check out this article if you need a sample.
http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htm
Is there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)yes
Read
http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary
recognizes it as a known word. If it is identified as a stop word, or if no
dictionary recognizes the token, it will be discarded and not indexed or
searched for. The general rule for configuring a list of dictionaries is to
place first the most narrow, most specific dictionary, then the more general
dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or
simple, which recognizes everything."
Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words
that are words but for some reason is not recognized "it will be discarded
and not indexed or searched for." which I consider a problem since I don't
trust my configuration to cover everything.
Is this not a valid concern?
quick example:
CREATE TEXT SEARCH CONFIGURATION arabic (
COPY = english
);=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stemThen you can alter this configuration.
Yes, I figured thats the next step but thought I should get the lexize to
work first? What do you think?
Just a thought, say I have this :
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem;
is it possible to keep adding dictionaries, to get both arabic and english
matches on the same column (arabic people tend to mix), like this :
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell,
english_stem;
Will something like that work ?
/ Moe
Mohamed,
please, try to read docs and think a bit first.
On Mon, 2 Feb 2009, Mohamed wrote:
On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <oleg@sai.msu.su> wrote:
On Mon, 2 Feb 2009, Oleg Bartunov wrote:
On Mon, 2 Feb 2009, Mohamed wrote:
Hehe, ok..
I don't know either but I took some lines from Al-Jazeera :
http://aljazeera.net/portaljust made the change you said and created it successfully and tried this
:select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ????
?????????
?????')but I got nothing... :(
You did wrong ! ts_lexize expects word, not phrase !
Mohamed, what did you expect from ts_lexize ? Please, provide us valuable
information, else we can't help you.What I expected was something to be returned. After all they are valid words
taken from an article. (perhaps you don't see the words, but only ???... )
Am I wrong to expect something ? Should I go for setting up the
configuration completly first?
You should definitely read documentation
http://www.postgresql.org/docs/8.3/static/textsearch-debugging.html#TEXTSEARCH-DICTIONARY-TESTING
Period.
SELECT ts_lexize('norwegian_ispell',
'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}Check out this article if you need a sample.
http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htmIs there a way of making sure that words not recognized also gets
indexed/searched for ? (Not that I think this is the problem)yes
Read
http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary
recognizes it as a known word. If it is identified as a stop word, or if no
dictionary recognizes the token, it will be discarded and not indexed or
searched for. The general rule for configuring a list of dictionaries is to
place first the most narrow, most specific dictionary, then the more general
dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or
simple, which recognizes everything."Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words
that are words but for some reason is not recognized "it will be discarded
and not indexed or searched for." which I consider a problem since I don't
trust my configuration to cover everything.Is this not a valid concern?
quick example:
CREATE TEXT SEARCH CONFIGURATION arabic (
COPY = english
);=# \dF+ arabic
Text search configuration "public.arabic"
Parser: "pg_catalog.default"
Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | english_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | english_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | english_stemThen you can alter this configuration.
Yes, I figured thats the next step but thought I should get the lexize to
work first? What do you think?Just a thought, say I have this :
ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem;is it possible to keep adding dictionaries, to get both arabic and english
matches on the same column (arabic people tend to mix), like this :ALTER TEXT SEARCH CONFIGURATION pg
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell,
english_stem;Will something like that work ?
/ Moe
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Little harsh, are we? I have read the WHOLE documentation, it's a bit long
so confusion might arise + I am not familiar with postgre AT ALL so the
confusion grows.
Perhaps I am an idiot and you don't like helping idiots or perhaps it's
something else? Which one is it?
If you don't want to help me, then DON'T ! Period.
The mailing list is not yours.
.
.
.
I have tried ts_lexize with words, lots of them and I have yet to get
something out of it!
/ Moe