Full text index without accents

Started by Fco. Mario Barcalaalmost 18 years ago9 messagesgeneral
Jump to latest
#1Fco. Mario Barcala
lbarcala@freeresearch.org

Hi again:

I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.

Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?

Regars,

Mario Barcala

#2Oleg Bartunov
oleg@sai.msu.su
In reply to: Fco. Mario Barcala (#1)
Re: Full text index without accents

You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery

Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:

Hi again:

I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.

Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?

Regars,

Mario Barcala

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#3Fco. Mario Barcala
lbarcala@freeresearch.org
In reply to: Oleg Bartunov (#2)
Re: Full text index without accents

And which are the types of argument and returning values of a pl/sql
function which preprocess de text?

I have been searching that, for example, something like this works fine:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));

where text is the text column of document. But I have tried to do
something like:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));

where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.

I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.

Thanks,

Mario Barcala

Show quoted text

You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery

Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:

Hi again:

I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.

Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?

Regars,

Mario Barcala

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#4Oleg Bartunov
oleg@sai.msu.su
In reply to: Fco. Mario Barcala (#3)
Re: Full text index without accents

Here is an example

CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;

arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2

On Tue, 22 Jul 2008, Fco. Mario Barcala Rodr?guez wrote:

And which are the types of argument and returning values of a pl/sql
function which preprocess de text?

I have been searching that, for example, something like this works fine:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',upper(text)));

where text is the text column of document. But I have tried to do
something like:

CREATE INDEX textindex ON document USING
gin(to_tsvector('english',myfunction(text)));

where myfunction is a PL/SQL function which call upper one, but I didn't
find which are the types of the myfunction argument and returning value.

I am a PL/SQL novice and I didn't find how to do it yet. Of course, then
I will have to change upper experiment to my objective: to index without
accents. I don't know if PL/SQL is the better option to build such
function.

Thanks,

Mario Barcala

You can preprocess text (replace accent by nothing) before
to_tsvector or to_tsquery

Oleg
On Thu, 3 Jul 2008, lbarcala@freeresearch.org wrote:

Hi again:

I am trying to create a full text configuration to ignore word accents in
my searches. My approach is similar to simple dicionary one, but i want to
remove accents after converting to lower.

Is it the only way to do it to develop another .c and write my own
dict_noaccent.c, and then compile and install it into the system?

Regars,

Mario Barcala

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#5Fco. Mario Barcala
lbarcala@freeresearch.org
In reply to: Oleg Bartunov (#4)
Re: Full text index without accents

Finally I create a function like:

CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;

Then, create text indexes, one for sensible queries and other for
unsensible ones:

CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));

CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));

And then make a query sensible or unsensible to accents doing:

SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');

or:

SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.

I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.

Thank you very much,

Mario Barcala

Show quoted text

Here is an example

CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;

arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2

#6Jonathan Bond-Caron
jbondc@openmv.com
In reply to: Fco. Mario Barcala (#5)
Re: Full text index without accents

This would probably help:

CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
RETURNS character varying AS
$BODY$
declare
p_str alias for $1;
v_str varchar;
begin
select translate(p_str, 'ÀÁÂÃÄÅ', 'AAAAAA') into v_str;
select translate(v_str, 'ÉÈËÊ', 'EEEE') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÌÍÎÏ', 'IIII') into v_str;
select translate(v_str, 'ÒÓÔÕÖ', 'OOOOO') into v_str;
select translate(v_str, 'ÙÚÛÜ', 'UUUU') into v_str;
select translate(v_str, 'àáâãäå', 'aaaaaa') into v_str;
select translate(v_str, 'èéêë', 'eeee') into v_str;
select translate(v_str, 'ìíîï', 'iiii') into v_str;
select translate(v_str, 'òóôõö', 'ooooo') into v_str;
select translate(v_str, 'ùúûü', 'uuuu') into v_str;
select translate(v_str, 'Çç', 'Cc') into v_str;
return v_str;
end;$BODY$
LANGUAGE 'plpgsql' VOLATILE;

There's also o useful functions here:
http://www.project-open.org/doc/intranet-search-pg/intranet-search-pg-create.sql

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Fco. Mario Barcala Rodríguez
Sent: July 24, 2008 4:47 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents

Finally I create a function like:

CREATE OR REPLACE FUNCTION nonsensible (text) RETURNS text AS $$
DECLARE
var1 varchar;
BEGIN
var1=replace($1, 'á', 'a');
var1=replace(var1, 'é', 'e');
var1=replace(var1, 'í', 'i');
var1=replace(var1, 'ó', 'o');
var1=replace(var1, 'ú', 'u');
var1=replace(var1, 'Á', 'A');
var1=replace(var1, 'É', 'E');
var1=replace(var1, 'Í', 'I');
var1=replace(var1, 'Ó', 'O');
var1=replace(var1, 'Ú', 'U');
return var1;
END
$$LANGUAGE plpgsql immutable;

Then, create text indexes, one for sensible queries and other for
unsensible ones:

CREATE INDEX textindex ON document USING
gin(to_tsvector('spanish',text));

CREATE INDEX textindexn ON document USING
gin(to_tsvector('spanish',nonsensible(text)));

And then make a query sensible or unsensible to accents doing:

SELECT id FROM document WHERE to_tsvector('spanish',text) @@
to_tsquery('spanish','word_with_accent');

or:

SELECT id FROM document WHERE to_tsvector('spanish',nonsensible(text))
@@ to_tsquery('spanish',nonsensible('word_with_accent'));
respectively.

I think postgreSQL uses both indexes as necessary. I believe to remember
reading something about it in the documentation.

Thank you very much,

Mario Barcala

Here is an example

CREATE FUNCTION dropatsymbol(text) RETURNS text
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;

arxiv=# select to_tsvector('english',dropatsymbol('oleg@sai.msu.su'));
to_tsvector
-------------------------
'oleg':1 'sai.msu.su':2

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Jonathan Bond-Caron (#6)
Re: Full text index without accents

Jonathan Bond-Caron wrote:

This would probably help:

CREATE OR REPLACE FUNCTION norm_text_latin(character varying)
RETURNS character varying AS
$BODY$
declare
p_str alias for $1;
v_str varchar;
begin
select translate(p_str, '������', 'AAAAAA') into v_str;

Hmm, why not simply use to_ascii() ?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#7)
Re: Full text index without accents

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, why not simply use to_ascii() ?

The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.

regards, tom lane

#9Jonathan Bond-Caron
jbondc@gmail.com
In reply to: Tom Lane (#8)
Re: Full text index without accents

Ya the function name norm_text_latin() was probably misleading, it takes
latin1-ish characters *encoded in UTF8* and brings them to ascii.

Definitely, the following would be much simpler:
SELECT to_ascii('ÀÁÂÃÄÅÒÓÔÕÖ', 'UTF8')

As of 8.3, you have to do some magic with to_ascii() and utf8 characters

SELECT to_ascii(convert_to_latin('ÀÁÂÃÄÅÒÓÔÕÖ'), 'LATIN1')

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
Sent: July 25, 2008 10:42 AM
To: Alvaro Herrera
Cc: Jonathan Bond-Caron; 'Fco. Mario Barcala Rodríguez';
pgsql-general@postgresql.org
Subject: Re: [GENERAL] Full text index without accents

Alvaro Herrera <alvherre@commandprompt.com> writes:

Hmm, why not simply use to_ascii() ?

The big problem with to_ascii is its inadequate set of supported
encodings. Somebody *really* needs to give it some love on that
front.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general