UTF8 regexp and char classes still does not work

Started by Sergey Burladyanover 15 years ago3 messages
#1Sergey Burladyan
eshkinkot@gmail.com

I see this in 9.0 Release note:
- Support locale-specific regular expression processing with UTF-8
server encoding (Tom Lane)
Locale-specific regular expression functionality includes
case-insensitive matching and locale-specific character classes.

But character classes still does not work, example (git REL9_0_STABLE c767c3bd):
select version();
version
------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Debian 4.4.4-8) 4.4.5 20100728 (prerelease), 64-bit

--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
 ?column? | ?column? | ?column? 
----------+----------+----------
 t        | f        | t

all must be true, like below:

create database koi8 template template0 encoding 'koi8r' lc_collate 'ru_RU.KOI8-R' lc_ctype 'ru_RU.KOI8-R';
\c koi8
set client_encoding TO utf8;
select E'\326' ~* E'\366', E'\326' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
?column? | ?column? | ?column?
----------+----------+----------
t | t | t

As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v, /* context */
const chr *startp, /* where the name starts */
const chr *endp, /* just past the end of the name */
int cases) /* case-independent? */
function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes" and it call pg_wc_isalpha like this:
for (i = 0; i <= UCHAR_MAX; i++)
{
if (pg_wc_isalpha((chr) i))
addchr(cv, (chr) i);
}
UCHAR_MAX is 255

I do not understand fully this algorithm of regular expressions, but i think cclass function also need fix.

--
Sergey Burladyan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sergey Burladyan (#1)
Re: UTF8 regexp and char classes still does not work

Sergey Burladyan <eshkinkot@gmail.com> writes:

As i can see in Tom's patch 0d323425 only functions like pg_wc_isalpha is changed, but
this pg_wc_isalpha is called from
static struct cvec *
cclass(struct vars * v, /* context */
const chr *startp, /* where the name starts */
const chr *endp, /* just past the end of the name */
int cases) /* case-independent? */
function, and this function have comment "For the moment, assume that only char codes < 256 can be in these classes" and it call pg_wc_isalpha like this:
for (i = 0; i <= UCHAR_MAX; i++)
{
if (pg_wc_isalpha((chr) i))
addchr(cv, (chr) i);
}
UCHAR_MAX is 255

Hmm, you're right. I only tested that on Latin1 characters, for which
it does work because those have Unicode points below 256. I'm not
sure of a reasonable solution for the general case --- we certainly
don't want this function iterating up to 2^21 or thereabouts.

Your test case seems to be using KOI8 encoding, though, which doesn't
have anything to do with UTF8 behavior.

regards, tom lane

#3Sergey Burladyan
eshkinkot@gmail.com
In reply to: Tom Lane (#2)
Re: UTF8 regexp and char classes still does not work

Tom Lane <tgl@sss.pgh.pa.us> writes:

Hmm, you're right. I only tested that on Latin1 characters, for which
it does work because those have Unicode points below 256. I'm not
sure of a reasonable solution for the general case --- we certainly
don't want this function iterating up to 2^21 or thereabouts.

Yes, i understand this problem. How perl do this? May be this Unicode table can
be precomputed or linked to postgres binary from external source?

Your test case seems to be using KOI8 encoding, though, which doesn't
have anything to do with UTF8 behavior.

It's just for example of expected result. See first test, it is UTF8, two bytes per character:

--- CYRILLIC SMALL LETTER ZHE ~* CYRILLIC CAPITAL LETTER ZHE
select E'\320\266' ~* E'\320\226', E'\320\266' ~ '[[:alpha:]]+', 'g' ~ '[[:alpha:]]+';
?column? | ?column? | ?column? 
----------+----------+----------
t        | f        | t

--
Sergey Burladyan