Regexps vs. locale
This came up on irc:
postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8
postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)
postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)
postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)
Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.
Why is this? It does not appear to be a documented restriction.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.
It's not so much intentional as that no one has gotten around to making
it work. The difficulty is that the wide-char codes we are using might
not match what the <wctype.h> functions expect, and it's unclear what
we could do to fix that.
regards, tom lane
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally)
crippled so as not to support non-ascii chars, despite all the
code there using wide chars for everything otherwise.
Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.
Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?
This obviously requires that we have a matching lc_ctype for the
encoding, but we insist on that now anyway, no?
--
Andrew.
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.
Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?
Possibly. I think we did not have the char2wchar() infrastructure
when the regexp stuff was last gone over, so it might be more practical
to do that now.
regards, tom lane
Added to TODO:
Add ability to use case-insensitive regular expressions on multi-byte
characters
ILIKE already works with multi-byte characters
* http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php
---------------------------------------------------------------------------
Andrew Gierth wrote:
This came up on irc:
postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.Why is this? It does not appear to be a documented restriction.
--
Andrew (irc:RhodiumToad)--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +