unaccent
Greetings,
As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
function that uses libunac to unaccent TEXT fileds.
The resulting text is in UTF-8, but you can modify it in the sources with
an appropriate value (using iconv charset names).
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.
Link it in postgresql:
CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;
What about integrating unaccent libraries directly in tsearch? It is useful
for french search engines (for instance).
Bye.
Nhan NGO DINH
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/
Attachments:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Greetings,
As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
function that uses libunac to unaccent TEXT fileds.The resulting text is in UTF-8, but you can modify it in the sources with
an appropriate value (using iconv charset names).Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.Link it in postgresql:
CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;What about integrating unaccent libraries directly in tsearch? It is useful
for french search engines (for instance).
I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.
btw, use palloc instead of malloc in postgresql functions .
Bye.
Nhan NGO DINH
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione nО©╫ di ricarica!
http://ricaricasaonline.tiscali.it/
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.
I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".
UPDATE titles SET utitle = unac(title);
Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.
Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:
UPDATE titles SET titleidx = txt2txtidx(unac(title));
But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.
Bye.
-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccentOn Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Greetings,
As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implemented
a
PostgreSQL
function that uses libunac to unaccent TEXT fileds.
The resulting text is in UTF-8, but you can modify it in the sources
with
an appropriate value (using iconv charset names).
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.Link it in postgresql:
CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;What about integrating unaccent libraries directly in tsearch? It is
useful
for french search engines (for instance).
I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.btw, use palloc instead of malloc in postgresql functions .
Bye.
Nhan NGO DINH
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/
Import Notes
Resolved by subject fallback
Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.
-- Messaggio Originale --
Date: Wed, 18 Sep 2002 14:24:26 +0200
From: Karel Zak <zakkr@zf.jcu.cz>
To: Oleg Bartunov <oleg@sai.msu.su>
Cc: nngodinh@tiscali.it, pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccentOn Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/
On Wed, 18 Sep 2002, Karel Zak wrote:
On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.I think about --with-unaccent for PostgreSQL and to_ascii() in
main tree. Comment?
Hmm, it'd require linking yet another library. contrib module is
a standard way to test/develope possible future feature.
Karel
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".UPDATE titles SET utitle = unac(title);
Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:UPDATE titles SET titleidx = txt2txtidx(unac(title));
But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.
what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?
Bye.
-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccentOn Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Greetings,
As far as I use the txtidx data structure in conjunction with gist indexing
to make a word indexing of a very large UNICODE db, I've implementeda
PostgreSQL
function that uses libunac to unaccent TEXT fileds.
The resulting text is in UTF-8, but you can modify it in the sources
with
an appropriate value (using iconv charset names).
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.Link it in postgresql:
CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;What about integrating unaccent libraries directly in tsearch? It is
useful
for french search engines (for instance).
I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.btw, use palloc instead of malloc in postgresql functions .
Bye.
Nhan NGO DINH
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione nО©╫ di ricarica!
http://ricaricasaonline.tiscali.it/---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
The txt2txtidx function works fine with unac. The problem is with the trigger:
create trigger txtidxupdate before update or insert on titles for each row
execute procedure tsearch(titleidx, title);
As you know tsearch(titleidx, unac(title)) doesn't work.
-- Messaggio Originale --
Date: Wed, 18 Sep 2002 17:04:56 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccentOn Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
The best way to use it is quite simple. If you want to index the table
"titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".UPDATE titles SET utitle = unac(title);
Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:UPDATE titles SET titleidx = txt2txtidx(unac(title));
But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first ideawas
to integrate unac in tsearch.
what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?Bye.
-- Messaggio Originale --
Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
From: Oleg Bartunov <oleg@sai.msu.su>
To: nngodinh@tiscali.it
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] unaccentOn Wed, 18 Sep 2002 nngodinh@tiscali.it wrote:
Greetings,
As far as I use the txtidx data structure in conjunction with gist
indexing
to make a word indexing of a very large UNICODE db, I've implemented
a
PostgreSQL
function that uses libunac to unaccent TEXT fileds.
The resulting text is in UTF-8, but you can modify it in the sources
with
an appropriate value (using iconv charset names).
Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
Extract the archive, compile it (make). Move pg_unac.so to your postgresql
shared libraries dir.Link it in postgresql:
CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
C;What about integrating unaccent libraries directly in tsearch? It
is
useful
for french search engines (for instance).
I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.btw, use palloc instead of malloc in postgresql functions .
Bye.
Nhan NGO DINH
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione n? di ricarica!
http://ricaricasaonline.tiscali.it/---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmasterRegards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
__________________________________________________________________
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/
nngodinh@tiscali.it writes:
Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.
Really, the accent conversion should be part of the character set
conversion routines. At least my local iconv does that.
In general, the determination of what is an accent and how to convert it
is both dependent on locale and the intended usage. It's not clear how
that should be handled.
--
Peter Eisentraut peter_e@gmx.net