Regexps vs. locale

Started by Andrew Gierthabout 17 years ago5 messages

Andrew Gierth

andrew@tao11.riddles.org.uk

about 17 years ago

This came up on irc:

postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8

postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)

postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.

Why is this? It does not appear to be a documented restriction.

--
Andrew (irc:RhodiumToad)

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Andrew Gierth (#1)

Re: Regexps vs. locale

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.

It's not so much intentional as that no one has gotten around to making
it work. The difficulty is that the wide-char codes we are using might
not match what the <wctype.h> functions expect, and it's unclear what
we could do to fix that.

regards, tom lane

Andrew Gierth

andrew@tao11.riddles.org.uk

about 17 years ago

In reply to: Tom Lane (#2)

Re: Regexps vs. locale

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally)
crippled so as not to support non-ascii chars, despite all the
code there using wide chars for everything otherwise.

Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.

Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

This obviously requires that we have a matching lc_ctype for the
encoding, but we insist on that now anyway, no?

--
Andrew.

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Andrew Gierth (#3)

Re: Regexps vs. locale

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> It's not so much intentional as that no one has gotten around to
Tom> making it work. The difficulty is that the wide-char codes we
Tom> are using might not match what the <wctype.h> functions expect,
Tom> and it's unclear what we could do to fix that.

Couldn't we follow the example of lower(), and convert the string to
wchar_t using mbstowcs (rather than pg_wchar_t and pg_mb2wchar)?

Possibly. I think we did not have the char2wchar() infrastructure
when the regexp stuff was last gone over, so it might be more practical
to do that now.

regards, tom lane

Bruce Momjian

bruce@momjian.us

about 17 years ago

In reply to: Andrew Gierth (#1)

Re: Regexps vs. locale

Added to TODO:

Add ability to use case-insensitive regular expressions on multi-byte
characters

ILIKE already works with multi-byte characters

* http://archives.postgresql.org/pgsql-hackers/2008-12/msg00433.php

---------------------------------------------------------------------------

Andrew Gierth wrote:

This came up on irc:

postgres=# show lc_ctype;
lc_ctype
-------------
fr_FR.UTF-8

postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# select E'\303\201' ILIKE E'\303\241';
?column?
----------
t
(1 row)

postgres=# select E'\303\201' ~* E'\303\241';
?column?
----------
f
(1 row)

Obviously, this happens because the locale support functions in
backend/regex/regc_locale.c are (presumably intentionally) crippled so
as not to support non-ascii chars, despite all the code there using
wide chars for everything otherwise.

Why is this? It does not appear to be a documented restriction.

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +