UTF-8 and Regular expression

Started by Håvard Wahl Kongsgårdalmost 15 years ago2 messagesgeneral
Jump to latest
#1Håvard Wahl Kongsgård
haavard.kongsgaard@gmail.com

Hi, in 8.4 how does the regular expression functions in postgresql handle
special UTF-8 characters?

for example:
SELECT name,substring(name from E'\\w+\\s(\\w+)$') from nodes;
fails to select characters like ü ø æ å

--
Håvard Wahl Kongsgård

http://havard.security-review.net/

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Håvard Wahl Kongsgård (#1)
Re: UTF-8 and Regular expression

=?ISO-8859-1?Q?H=E5vard_Wahl_Kongsg=E5rd?= <haavard.kongsgaard@gmail.com> writes:

Hi, in 8.4 how does the regular expression functions in postgresql handle
special UTF-8 characters?

Badly :-(

for example:
SELECT name,substring(name from E'\\w+\\s(\\w+)$') from nodes;
fails to select characters like � � � �

Should work in 9.0, but no chance in earlier releases. You need to use
a single-byte encoding such as LATIN1 if you need to do this in older
releases. In any release, make sure you're using an LC_COLLATE setting
that's appropriate for the language and encoding.

regards, tom lane