Full text index without accents
Hi again:
I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.
Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?
Regars,
Mario Barcala
You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery
Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:
Hi again:
I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?Regars,
Mario Barcala
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
And which are the types of argument and returning values of a pl/sql
function which preprocess de text?
I have been searching that, for example, something like this works fine:
CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));
where text is the text column of document. But I have tried to do
something like:
CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));
where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.
I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.
Thanks,
Mario Barcala
Show quoted text
You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsqueryOleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:Hi again:
I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?Regars,
Mario Barcala
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Here is an example
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;
arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2
On Tue, 22 Jul 2008, Fco. Mario Barcala Rodr?guez wrote:
And which are the types of argument and returning values of a pl/sql
function which preprocess de text?I have been searching that, for example, something like this works fine:
CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));where text is the text column of document. But I have tried to do
something like:CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.Thanks,
Mario Barcala
You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsqueryOleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:Hi again:
I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?Regars,
Mario Barcala
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Finally I create a function like:
CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;
Then, create text indexes, one for sensible queries and other for
unsensible ones:
CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));
CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));
And then make a query sensible or unsensible to accents doing:
SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');
or:
SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.
I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.
Thank you very much,
Mario Barcala
Show quoted text
Here is an example
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2
This would probably help:
CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
RETURNS character varying AS
$BODY$
declare
p_str alias for $1;
v_str varchar;
begin
select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str;
select translate(v_str, 'ÉÈËÊ', 'EEEE') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÒÓÔÕÖ', 'OOOOO') into v_str;
select translate(v_str, 'ÙÚÛÜ', 'UUUU') into v_str;
select translate(v_str, 'àáâãäå', 'aaaaaa') into v_str;
select translate(v_str, 'èéêë', 'eeee') into v_str;
select translate(v_str, 'ìíîï', 'iiii') into v_str;
select translate(v_str, 'òóôõö', 'ooooo') into v_str;
select translate(v_str, 'ùúûü', 'uuuu') into v_str;
select translate(v_str, 'Çç', 'Cc') into v_str;
return v_str;
end;$BODY$
LANGUAGE 'plpgsql' VOLATILE;
There's also o useful functions here:
http://www.project-open.org/doc/intranet-search-pg/intranet-search-pg-create.sql
-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Fco. Mario Barcala Rodríguez
Sent: July 24, 2008 4:47 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents
Finally I create a function like:
CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;
Then, create text indexes, one for sensible queries and other for
unsensible ones:
CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));
CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));
And then make a query sensible or unsensible to accents doing:
SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');
or:
SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.
I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.
Thank you very much,
Mario Barcala
Here is an example
CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Jonathan Bond-Caron wrote:
This would probably help:
CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
RETURNS character varying AS
$BODY$
declare
p_str alias for $1;
v_str varchar;
begin
select translate(p_str, '������', 'AAAAAA') into v_str;
Hmm, why not simply use to_ascii() ?
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Hmm, why not simply use to_ascii() ?
The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.
regards, tom lane
Ya the function name norm_text_latin() was probably misleading, it takes
latin1-ish characters *encoded in UTF8* and brings them to ascii.
Definitely, the following would be much simpler:
SELECT to_ascii('ÀÁÂÃÄÅÒÓÔÕÖ', 'UTF8')
As of 8.3, you have to do some magic with to_ascii() and utf8 characters
SELECT to_ascii(convert_to_latin('ÀÁÂÃÄÅÒÓÔÕÖ'), 'LATIN1')
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: July 25, 2008 10:42 AM
To: Alvaro Herrera
Cc: Jonathan Bond-Caron; 'Fco. Mario Barcala Rodríguez';
pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents
Alvaro Herrera <alvherre@commandprompt.com> writes:
Hmm, why not simply use to_ascii() ?
The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general