regular expressions stranges

Started by Teodor Sigaevalmost 19 years ago4 messages
#1Teodor Sigaev
teodor@sigaev.ru
1 attachment(s)

Regexp works differently with no-ascii characters depending on server encoding
(bug.sql contains non-ascii char):

% initdb -E KOI8-R --locale ru_RU.KOI8-R
% psql postgres < bug.sql
true
------
t
(1 row)

true | true
------+------
t | t
(1 row)
% initdb -E UTF8 --locale ru_RU.UTF-8
% psql postgres < bug.sql
true
------
f
(1 row)

true | true
------+------
f | t
(1 row)

As I can see, that is because of using isalpha (and other is*), tolower &
toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
can modify regc_locale.c similarly to tsearch2 locale part.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Attachments:

bug.sqltext/plain; name=bug.sqlDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#1)
Re: regular expressions stranges

Teodor Sigaev <teodor@sigaev.ru> writes:

As I can see, that is because of using isalpha (and other is*), tolower &
toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
can modify regc_locale.c similarly to tsearch2 locale part.

The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect. If we could guarantee compatibility then the above plan
would make sense ...

regards, tom lane

#3Teodor Sigaev
teodor@sigaev.ru
In reply to: Tom Lane (#2)
Re: regular expressions stranges

The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect. If we could guarantee compatibility then the above plan
would make sense ...

it seems to me, that is possible for UTF8 encoding. So isalpha() function may be
defined as:

static int
pg_wc_isalpha(pg_wchar c)
{
if ( (c >= 0 && c <= UCHAR_MAX) )
return isalpha((unsigned char) c)
#ifdef HAVE_WCSTOMBS
else if ( GetDatabaseEncoding() == PG_UTF8 )
return iswalpha((wint_t) c)
#endif
return 0;
}

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Teodor Sigaev (#3)
Re: regular expressions stranges

Teodor Sigaev <teodor@sigaev.ru> writes:

The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect. If we could guarantee compatibility then the above plan
would make sense ...

it seems to me, that is possible for UTF8 encoding.

Why? The one thing that a wchar certainly is not is UTF8.
It might be that the <wctype.h> functions are expecting UTF16 or UTF32,
but we don't know which, and really we can hardly even be sure they're
expecting Unicode at all.

regards, tom lane