BUG #4200: Regexp character classes not UTF8-compliant

Started by Jean-Baptiste Quenotalmost 18 years ago2 messagesbugs
Jump to latest
#1Jean-Baptiste Quenot
jbq@caraldi.com

The following bug has been logged online:

Bug reference: 4200
Logged by: Jean-Baptiste Quenot
Email address: jbq@caraldi.com
PostgreSQL version: 8.3.1
Operating system: Linux Ubuntu Hardy
Description: Regexp character classes not UTF8-compliant
Details:

PostgreSQL documentation at
http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
the various character classes, and they can be used to match or replace
strings with regexp support. However, the [:alnum:] and [:alpha:] character
classes are not UTF8-compliant, like shown in the examples below:

dockee=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)

dockee=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)

dockee=# select regexp_replace('bébéàu', '[[:alnum:]]', '', 'g');
regexp_replace
----------------
ééà
(1 row)

ovhdev=# select regexp_replace('bébéàu', '[[:alpha:]]', '', 'g');
regexp_replace
----------------
ééà
(1 row)

dockee=# select regexp_replace('bébéàu', $$\w$$, '', 'g');
regexp_replace
----------------
ééà
(1 row)

Only characters in the ASCII range were correctly detected to belong to the
[:alnum:] character class, whereas other characters are valid too.

#2Bruce Momjian
bruce@momjian.us
In reply to: Jean-Baptiste Quenot (#1)
Re: BUG #4200: Regexp character classes not UTF8-compliant

I am not sure how to help you except to say that UTF8 is a character set
encoding, while en_US.UTF-8 is more of an encoding with a locale. My
guess is that if you use *.UTF-8 where you specified the proper
localization language, it would work.

http://www.postgresql.org/docs/8.2/static/locale.html

---------------------------------------------------------------------------

Jean-Baptiste Quenot wrote:

The following bug has been logged online:

Bug reference: 4200
Logged by: Jean-Baptiste Quenot
Email address: jbq@caraldi.com
PostgreSQL version: 8.3.1
Operating system: Linux Ubuntu Hardy
Description: Regexp character classes not UTF8-compliant
Details:

PostgreSQL documentation at
http://www.postgresql.org/docs/8.3/static/functions-matching.html describes
the various character classes, and they can be used to match or replace
strings with regexp support. However, the [:alnum:] and [:alpha:] character
classes are not UTF8-compliant, like shown in the examples below:

dockee=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)

dockee=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)

dockee=# select regexp_replace('b��b����u', '[[:alnum:]]', '', 'g');
regexp_replace
----------------
������
(1 row)

ovhdev=# select regexp_replace('b��b����u', '[[:alpha:]]', '', 'g');
regexp_replace
----------------
������
(1 row)

dockee=# select regexp_replace('b��b����u', $$\w$$, '', 'g');
regexp_replace
----------------
������
(1 row)

Only characters in the ASCII range were correctly detected to belong to the
[:alnum:] character class, whereas other characters are valid too.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +