Unicode and unaccent()

Started by Mark Borinsabout 21 years ago4 messagesgeneral

mark.borins@rigadev.com

about 21 years ago

I am trying to write an unaccent function because I need to do some queries
comparing data that has accents and data that does not.

The encoding on my DB is Unicode, so far I have found an unaccent() function
by looking in the mail archives it looks like the following:

CREATE FUNCTION unaccent(text) RETURNS text AS $$
BEGIN
RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou')
; END; $$ LANGUAGE plpgsql IMMUTABLE STRICT;

My problem is that the values like \342 are for LATIN1 type encoding. I
have tried and failed to get this working using the what I think is the
Unicode escaping method \u0032 for example.

Even if someone could help me with the Unicode escaping method that would be
useful. For example if I wanted to find a Unicode character 0x00E2 with a
select statement how would I?

Something like select * from table where field like '%\u00e2%';

Doesn't seem to work.

Does anyone have a good method for unaccenting Unicode dbs/characters?

I am using PG7.4 on FC2

Thank you

Peter Eisentraut

peter_e@gmx.net

about 21 years ago

In reply to: Mark Borins (#1)

Re: Unicode and unaccent()

Mark Borins wrote:

My problem is that the values like \342 are for LATIN1 type encoding.
I have tried and failed to get this working using the what I think
is the Unicode escaping method \u0032 for example.

There is no Unicode escaping method. You need to encode the characters
into UTF-8 yourself and write out the individual bytes using the octal
escape sequences.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Daniel Verite

daniel@manitou-mail.org

about 21 years ago

In reply to: Mark Borins (#1)

Re: Unicode and unaccent()

Mark Borins wrote:

The encoding on my DB is Unicode, so far I have found an unaccent() function
by looking in the mail archives it looks like the following:

CREATE FUNCTION unaccent(text) RETURNS text AS $$
BEGIN
RETURN translate($1, '\342\347\350\351\352\364\373', 'aceeeou')
; END; $$ LANGUAGE plpgsql IMMUTABLE STRICT;

My problem is that the values like \342 are for LATIN1 type encoding. I

Why wouldn't this:
RETURN translate($1, 'éçàêè...', 'ecaee...') ;
work just fine? It's even portable across encodings.

--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

Mark Borins

mark.borins@rigadev.com

about 21 years ago

In reply to: Peter Eisentraut (#2)

Re: Unicode and unaccent()

I am not sure how I could encode the characters into UTF-8.

For example, I went to Unicode.org and looked up in the specs for lets say
an â is 00E2. If I wanted to do search for all names with an â in them how
would I do that?

00E2 into Octal is: 342

So would I do:

Select * from table where name like '%\342%'

This leads to a greater question.
I am trying to convert a Unicode DB to Latin1 because I realized we have
absolutely no reason to be using Unicode.

When I try to restore the back of a Unicode database into Latin1 I am
getting some conversion errors as there are characters in Unicode that
cannot be converted automatically into Latin1.

These are erroneous characters and I would like to find them, I am give the
hex value of the offending character. For example, 0x00E2, how would I
search for this character?

Thanks in advance for any help.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Peter Eisentraut
Sent: May 6, 2005 2:12 AM
To: Mark Borins
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Unicode and unaccent()

Mark Borins wrote:

My problem is that the values like \342 are for LATIN1 type encoding.
I have tried and failed to get this working using the what I think
is the Unicode escaping method \u0032 for example.

There is no Unicode escaping method. You need to encode the characters
into UTF-8 yourself and write out the individual bytes using the octal
escape sequences.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend