unaccent

Started by Nonameover 23 years ago9 messages
#1Noname
nngodinh@tiscali.it
1 attachment(s)

Greetings,

As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources with
an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It is useful
for french search engines (for instance).

Bye.

Nhan NGO DINH

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/

Attachments:

pg_unac-1.0.tar.gzapplication/x-gzip-compressedDownload
#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Noname (#1)
Re: unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Greetings,

As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources with
an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It is useful
for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

Bye.

Nhan NGO DINH

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione nО©╫ di ricarica!
http://ricaricasaonline.tiscali.it/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#3Karel Zak
zakkr@zf.jcu.cz
In reply to: Oleg Bartunov (#2)
Re: unaccent

On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

#4Noname
nngodinh@tiscali.it
In reply to: Karel Zak (#3)
Re: unaccent

The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".

UPDATE titles SET utitle = unac(title);

Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.

Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:

UPDATE titles SET titleidx = txt2txtidx(unac(title));

But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.

Bye.

-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Greetings,

As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented

a

PostgreSQL

function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources

with

an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It is

useful

for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

Bye.

Nhan NGO DINH

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/

#5Noname
nngodinh@tiscali.it
In reply to: Karel Zak (#3)
Re: unaccent

Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.

-- Messaggio Originale --
Date: Wed, 18 Sep 2002 14:24:26 +0200
From: Karel Zak <zakkr@zf.jcu.cz>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: nngodinh@tiscali.it, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccent

On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?

Karel

--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/

C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/

#6Oleg Bartunov
oleg@sai.msu.su
In reply to: Karel Zak (#3)
Re: unaccent

On Wed, 18 Sep 2002, Karel Zak wrote:

On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?

Hmm, it'd require linking yet another library. contrib module is
a standard way to test/develope possible future feature.

Karel

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#7Oleg Bartunov
oleg@sai.msu.su
In reply to: Noname (#4)
Re: unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".

UPDATE titles SET utitle = unac(title);

Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.

Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:

UPDATE titles SET titleidx = txt2txtidx(unac(title));

But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.

what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?

Bye.

-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Greetings,

As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented

a

PostgreSQL

function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources

with

an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It is

useful

for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

Bye.

Nhan NGO DINH

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione nО©╫ di ricarica!
http://ricaricasaonline.tiscali.it/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#8Noname
nngodinh@tiscali.it
In reply to: Oleg Bartunov (#7)
Re: unaccent

The txt2txtidx function works fine with unac. The problem is with the trigger:

create trigger txtidxupdate before update or insert on titles for each row
execute procedure tsearch(titleidx, title);

As you know tsearch(titleidx, unac(title)) doesn't work.

-- Messaggio Originale --
Date: Wed, 18 Sep 2002 17:04:56 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

The best way to use it is quite simple. If you want to index the table

"titles"

and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".

UPDATE titles SET utitle = unac(title);

Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.

Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:

UPDATE titles SET titleidx = txt2txtidx(unac(title));

But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea

was

to integrate unac in tsearch.

what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?

Bye.

-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccent

On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:

Greetings,

As far as I use the txtidx data structure in conjunction with gist

indexing

to make a word indexing of a very large UNICODE db, I've implemented

a

PostgreSQL

function that uses libunac to unaccent TEXT fileds.

The resulting text is in UTF-8, but you can modify it in the sources

with

an appropriate value (using iconv charset names).

Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)

Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.

Link it in postgresql:

CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;

What about integrating unaccent libraries directly in tsearch? It

is

useful

for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

Bye.

Nhan NGO DINH

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Noname (#5)
Re: unaccent

nngodinh@tiscali.it writes:

Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.

Really, the accent conversion should be part of the character set
conversion routines. At least my local iconv does that.

In general, the determination of what is an accent and how to convert it
is both dependent on locale and the intended usage. It's not clear how
that should be handled.

--
Peter Eisentraut peter_e@gmx.net