SOLVED: Re: UTF-8 and stripping accents

Started by Christopher Murtaghalmost 20 years ago2 messagesgeneral
Jump to latest
#1Christopher Murtagh
christopher.murtagh@gmail.com

Hey, I solved my own problem! I'm posting here because while I was
looking for solutions, I found tons of folks tackling the same
problem, most didn't find the solution or had to do cumbersome
'translate()'s to get what they wanted.

The difference between my 7.4.6 and 8.1.4 DBs was that 7.4.6 had
UNICODE as it's encoding, whereas the 8.1.4 was UTF8. So, the 7.4.6
needs the decode and the 8.1.4 doesn't.

Also, I had to escape the '\' in the regex.

So, for the record, to strip out all accents from UTF8 encoded text:

CREATE OR REPLACE FUNCTION strip_accents(text) RETURNS text
AS '
use Unicode::Normalize;
use Encode;

my $string = NFD($_[0]);
$string =~ s/\\p{Mn}//ogsm;
return NFC($string);
'
LANGUAGE plperlu;

For the 7.4.6 DB whose encoding was UNICODE, a slight difference:

CREATE OR REPLACE FUNCTION strip_accents(text) RETURNS text
AS '
use Unicode::Normalize;
use Encode;

my $string = NFD(decode( utf8 => $_[0]));
$string =~ s/\\p{Mn}//ogsm;
return NFC($string);
'
LANGUAGE plperlu;

I hope this is of some use to other folks here. Thanks to Mike
Rylander for the initial code.

Cheers,

Chris

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Murtagh (#1)
Re: SOLVED: Re: UTF-8 and stripping accents

"Christopher Murtagh" <christopher.murtagh@gmail.com> writes:

The difference between my 7.4.6 and 8.1.4 DBs was that 7.4.6 had
UNICODE as it's encoding, whereas the 8.1.4 was UTF8.

I don't think so, because those are the same thing. We just adopted a
more technically-correct name for the encoding.

My recollection is that we tweaked some things associated with passing
UTF8 strings into and out of plperl functions somewhere along the line.
The difference in behavior might be due to that.

regards, tom lane