Using a german affix file for compound words

Started by Wolfgang Winklerover 10 years ago6 messagesgeneral
Jump to latest
#1Wolfgang Winkler
wolfgang.winkler@digital-concepts.com

Hi!

We have a problem with importing a compound dictionary file for german.

I downloaded the files here:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

and converted them to utf-8 with iconv. The affix file seems ok when
opened with an editor.

When I try to create or alter a dictionary to use this affix file, I get
the following error:

alter TEXT SEARCH DICTIONARY german_ispell (
DictFile = german,
AffFile = german,
StopWords = german
);
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/pgsql/share/tsearch_data/german.affix": " ABE > -ABE,äBIN
"

This is the first occurrence of an umlaut character in the file. I've
found a view postings where the same file is used, e.g.:

/messages/by-id/556C1411.4010608@tbz-pariv.de

This users has been able to import the file. Am I missing something obvious?

ww

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Wolfgang Winkler (#1)
Re: Using a german affix file for compound words

On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler <
wolfgang.winkler@digital-concepts.com> wrote:

Hi!

We have a problem with importing a compound dictionary file for german.

I downloaded the files here:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

and converted them to utf-8 with iconv. The affix file seems ok when
opened with an editor.

When I try to create or alter a dictionary to use this affix file, I get
the following error:

alter TEXT SEARCH DICTIONARY german_ispell (
DictFile = german,
AffFile = german,
StopWords = german
);
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/pgsql/share/tsearch_data/german.affix": " ABE > -ABE,äBIN
"

This is the first occurrence of an umlaut character in the file. I've
found a view postings where the same file is used, e.g.:

/messages/by-id/556C1411.4010608@tbz-pariv.de

This users has been able to import the file. Am I missing something
obvious?

Arthur Zakirov could help you.

Show quoted text

ww

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Arthur Zakirov
a.zakirov@postgrespro.ru
In reply to: Oleg Bartunov (#2)
Re: Using a german affix file for compound words

On 28.01.2016 18:57, Oleg Bartunov wrote:

On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler
<wolfgang.winkler@digital-concepts.com
<mailto:wolfgang.winkler@digital-concepts.com>> wrote:

Hi!

We have a problem with importing a compound dictionary file for german.

I downloaded the files here:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

and converted them to utf-8 with iconv. The affix file seems ok when
opened with an editor.

When I try to create or alter a dictionary to use this affix file, I
get the following error:

alter TEXT SEARCH DICTIONARY german_ispell (
DictFile = german,
AffFile = german,
StopWords = german
);
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/pgsql/share/tsearch_data/german.affix": " ABE > -ABE,äBIN
"

This is the first occurrence of an umlaut character in the file.
I've found a view postings where the same file is used, e.g.:

/messages/by-id/556C1411.4010608@tbz-pariv.de

This users has been able to import the file. Am I missing something
obvious?

What version of PostgreSQL do you use?

I tested this dictionary on PostgreSQL 9.4.5. Downloaded from the link
files and executed commands:

iconv -f ISO-8859-1 -t UTF-8 german.aff -o german2.affix
iconv -f ISO-8859-1 -t UTF-8 german.dict -o german2.dict

I renamed them to german.affix and german.dict and moved to the
tsearch_data directory. Executed commands without errors:

-> create text search dictionary german_ispell (
Template = ispell,
DictFile = german,
AffFile = german,
Stopwords = german
);
DROP TEXT SEARCH DICTIONARY

-> select ts_lexize('german_ispell', 'test');
ts_lexize
-----------
{test}
(1 row)

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Wolfgang Winkler
wolfgang.winkler@digital-concepts.com
In reply to: Arthur Zakirov (#3)
Re: Using a german affix file for compound words

I'm using 9.4.5 as well and I used exactly the same iconv lines as you
postes below.

Are there any encoding options that have to be set right? The database
encoding is set to UTF8.

ww

Am 2016-01-28 um 17:34 schrieb Artur Zakirov:

On 28.01.2016 18:57, Oleg Bartunov wrote:

On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler
<wolfgang.winkler@digital-concepts.com
<mailto:wolfgang.winkler@digital-concepts.com>> wrote:

Hi!

We have a problem with importing a compound dictionary file for
german.

I downloaded the files here:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

and converted them to utf-8 with iconv. The affix file seems ok when
opened with an editor.

When I try to create or alter a dictionary to use this affix file, I
get the following error:

alter TEXT SEARCH DICTIONARY german_ispell (
DictFile = german,
AffFile = german,
StopWords = german
);
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/pgsql/share/tsearch_data/german.affix": " ABE >
-ABE,äBIN
"

This is the first occurrence of an umlaut character in the file.
I've found a view postings where the same file is used, e.g.:

/messages/by-id/556C1411.4010608@tbz-pariv.de

This users has been able to import the file. Am I missing something
obvious?

What version of PostgreSQL do you use?

I tested this dictionary on PostgreSQL 9.4.5. Downloaded from the link
files and executed commands:

iconv -f ISO-8859-1 -t UTF-8 german.aff -o german2.affix
iconv -f ISO-8859-1 -t UTF-8 german.dict -o german2.dict

I renamed them to german.affix and german.dict and moved to the
tsearch_data directory. Executed commands without errors:

-> create text search dictionary german_ispell (
Template = ispell,
DictFile = german,
AffFile = german,
Stopwords = german
);
DROP TEXT SEARCH DICTIONARY

-> select ts_lexize('german_ispell', 'test');
ts_lexize
-----------
{test}
(1 row)

--

*Wolfgang Winkler*
Geschäftsführung
wolfgang.winkler@digital-concepts.com
mobil +43.699.19971172

dc:*büro*
digital concepts Novak Winkler OG
Software & Design
Landstraße 68, 5. Stock, 4020 Linz
www.digital-concepts.com <http://www.digital-concepts.com&gt;
tel +43.732.997117.72
tel +43.699.1997117.2

Firmenbuchnummer: 192003h
Firmenbuchgericht: Landesgericht Linz

Attachments:

logo_dc_mail.pngimage/png; name=logo_dc_mail.pngDownload
#5Arthur Zakirov
a.zakirov@postgrespro.ru
In reply to: Wolfgang Winkler (#4)
Re: Using a german affix file for compound words

On 28.01.2016 20:36, Wolfgang Winkler wrote:

I'm using 9.4.5 as well and I used exactly the same iconv lines as you
postes below.

Are there any encoding options that have to be set right? The database
encoding is set to UTF8.

ww

What output does the command show:

-> SHOW LC_CTYPE;

?

Did you try a dictionary from
http://extensions.openoffice.org/en/project/german-de-de-frami-dictionaries
?
You need extract from a downloaded archive de_DE_frami.aff and
de_DE_frami.dic files, rename them and convert them to UTF-8.

Am 2016-01-28 um 17:34 schrieb Artur Zakirov:

On 28.01.2016 18:57, Oleg Bartunov wrote:

On Thu, Jan 28, 2016 at 6:04 PM, Wolfgang Winkler
<wolfgang.winkler@digital-concepts.com
<mailto:wolfgang.winkler@digital-concepts.com>> wrote:

Hi!

We have a problem with importing a compound dictionary file for
german.

I downloaded the files here:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

and converted them to utf-8 with iconv. The affix file seems ok when
opened with an editor.

When I try to create or alter a dictionary to use this affix file, I
get the following error:

alter TEXT SEARCH DICTIONARY german_ispell (
DictFile = german,
AffFile = german,
StopWords = german
);
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/pgsql/share/tsearch_data/german.affix": " ABE >
-ABE,äBIN
"

This is the first occurrence of an umlaut character in the file.
I've found a view postings where the same file is used, e.g.:

/messages/by-id/556C1411.4010608@tbz-pariv.de

This users has been able to import the file. Am I missing something
obvious?

What version of PostgreSQL do you use?

I tested this dictionary on PostgreSQL 9.4.5. Downloaded from the link
files and executed commands:

iconv -f ISO-8859-1 -t UTF-8 german.aff -o german2.affix
iconv -f ISO-8859-1 -t UTF-8 german.dict -o german2.dict

I renamed them to german.affix and german.dict and moved to the
tsearch_data directory. Executed commands without errors:

-> create text search dictionary german_ispell (
Template = ispell,
DictFile = german,
AffFile = german,
Stopwords = german
);
DROP TEXT SEARCH DICTIONARY

-> select ts_lexize('german_ispell', 'test');
ts_lexize
-----------
{test}
(1 row)

--

*Wolfgang Winkler*
Geschäftsführung
wolfgang.winkler@digital-concepts.com
mobil +43.699.19971172

dc:*büro*
digital concepts Novak Winkler OG
Software & Design
Landstraße 68, 5. Stock, 4020 Linz
www.digital-concepts.com <http://www.digital-concepts.com&gt;
tel +43.732.997117.72
tel +43.699.1997117.2

Firmenbuchnummer: 192003h
Firmenbuchgericht: Landesgericht Linz

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Wolfgang Winkler
wolfgang.winkler@digital-concepts.com
In reply to: Arthur Zakirov (#5)
Re: Using a german affix file for compound words

Am 2016-01-29 um 10:21 schrieb Artur Zakirov:

On 28.01.2016 20:36, Wolfgang Winkler wrote:

I'm using 9.4.5 as well and I used exactly the same iconv lines as you
postes below.

Are there any encoding options that have to be set right? The database
encoding is set to UTF8.

ww

What output does the command show:

-> SHOW LC_CTYPE;

?

Did you try a dictionary from
http://extensions.openoffice.org/en/project/german-de-de-frami-dictionaries
?
You need extract from a downloaded archive de_DE_frami.aff and
de_DE_frami.dic files, rename them and convert them to UTF-8.

I now tried with a new install of postgres 9.4.5 from the debian
repositories and everything worked fine.

test=# select to_tsvector('german_ispell','warenkorb');
to_tsvector
---------------------------------
'korb':1 'ware':1 'warenkorb':1
(1 Zeile)

The LC_CTYPE and LC_COLLATE are both set to de_AT.UTF-8. I guess setting
this values will fix the problem.

I'm goint to import the databases into the new instance and then I'll
try again.

Thanks for your help,

ww

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general