[9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

Started by David G. Johnstonalmost 15 years ago3 messagesgeneral
Jump to latest
#1David G. Johnston
david.g.johnston@gmail.com

PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB
Install Executable)

CREATE DATABASE betatest

TEMPLATE template0

ENCODING 'UTF8'

LC_COLLATE 'C'

LC_CTYPE 'C';

[connect to database]

CREATE DOMAIN idcode AS text

NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$')

;

SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR: value for domain
idcode violates check constraint "idcode_check" (note the accented “e”
between all the “A”s)

This is running just fine against a 9.0 install on the same machine. [\w]
is Unicode aware and server encoding is set (and confirmed via SHOW) to be
“UTF8”.

David J.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#1)
Re: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

"David Johnston" <polobo@yahoo.com> writes:

PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit (EnterpriseDB
Install Executable)

CREATE DATABASE betatest
TEMPLATE template0
ENCODING 'UTF8'
LC_COLLATE 'C'
LC_CTYPE 'C';

CREATE DOMAIN idcode AS text
NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$')
;

SELECT 'AAAAA�aaaaa'::idcode; // -> SQL Error: ERROR: value for domain
idcode violates check constraint "idcode_check" (note the accented �e�
between all the �A�s)

AFAICS that's correct behavior. C locale should not think that � is
a letter.

This is running just fine against a 9.0 install on the same machine.

We made some strides towards getting locale-sensitive stuff to work as
it "should" in 9.1. In particular, platform-specific creative
interpretations of what C locale means shouldn't happen anymore ...

regards, tom lane

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#2)
Re: [9.1beta1] UTF-8/Regex Word-Character Definition excluding accented letters

Got it. Changing LC_CTYPE to " English_United States.1252" restores the
correct behavior.

Thanks.

David J.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, May 30, 2011 10:40 PM
To: David Johnston
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] [9.1beta1] UTF-8/Regex Word-Character Definition
excluding accented letters

"David Johnston" <polobo@yahoo.com> writes:

PostgreSQL 9.1beta1, compiled by Visual C++ build 1500, 64-bit
(EnterpriseDB Install Executable)

CREATE DATABASE betatest
TEMPLATE template0
ENCODING 'UTF8'
LC_COLLATE 'C'
LC_CTYPE 'C';

CREATE DOMAIN idcode AS text
NOT NULL CHECK (VALUE ~* '^\w[-:\w]*$') ;

SELECT 'AAAAAéaaaaa'::idcode; // -> SQL Error: ERROR: value for
domain idcode violates check constraint "idcode_check" (note the

accented “e”

between all the “A”s)

AFAICS that's correct behavior. C locale should not think that é is a

letter.

This is running just fine against a 9.0 install on the same machine.

We made some strides towards getting locale-sensitive stuff to work as it
"should" in 9.1. In particular, platform-specific creative

interpretations of

Show quoted text

what C locale means shouldn't happen anymore ...

regards, tom lane