BUG #3433: regexp \m and \M don't work for cyrillic

Started by Andriy Rysinalmost 19 years ago3 messagesbugs

arysin@gmail.com

almost 19 years ago

The following bug has been logged online:

Bug reference: 3433
Logged by: Andriy Rysin
Email address: arysin@gmail.com
PostgreSQL version: 8.2.4
Operating system: Linux
Description: regexp \m and \M don't work for cyrillic
Details:

psql krym
krym=> \encoding
UTF8
krym=> create table test (txt varchar);
CREATE TABLE
krym=> insert into test values ('latin');
INSERT 0 1
krym=> insert into test values ('кирилиця');
INSERT 0 1
krym=> select * from test;
txt
----------
latin
кирилиця
(2 rows)

krym=> select * from test where txt ~* E'\\mla';
txt
-------
latin
(1 row)

krym=> select * from test where txt ~* E'\\mкир';
txt
-----
(0 rows)

escaping specials in regular expressions \m and \M for beginning of word and
end of word work for latin symbols bug don't for cyrillic

Tom Lane

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andriy Rysin (#1)

Re: BUG #3433: regexp \m and \M don't work for cyrillic

"Andriy Rysin" <arysin@gmail.com> writes:

escaping specials in regular expressions \m and \M for beginning of word and
end of word work for latin symbols bug don't for cyrillic

Sorry, the locale-specific regex features only work on single-byte
characters at the moment. In any case you'd need to be using a Russian
locale (maybe you are, but you didn't say). I'd expect this feature
to work with Cyrillic letters in ru_RU locale + KOI8 encoding, but not
elsewhere.

regards, tom lane

Andriy Rysin

arysin@gmail.com

almost 19 years ago

In reply to: Tom Lane (#2)

Re: BUG #3433: regexp \m and \M don't work for cyrillic

2007/7/7, Tom Lane <tgl@sss.pgh.pa.us>:

"Andriy Rysin" <arysin@gmail.com> writes:

escaping specials in regular expressions \m and \M for beginning of word

and

end of word work for latin symbols bug don't for cyrillic

Sorry, the locale-specific regex features only work on single-byte
characters at the moment. In any case you'd need to be using a Russian
locale (maybe you are, but you didn't say). I'd expect this feature
to work with Cyrillic letters in ru_RU locale + KOI8 encoding, but not
elsewhere.

Hi Tom,

I was using en_US.UTF-8 locale but you're right even if I create my cluster
with uk_UA.UTF-8 still \m would not work for cyrillic but would continue to
work for latin chars. I can't work with single-byte encodings as I have some
symbols from Unicode in my project and everything else is in Unicode so
converting data forth and back would be quite a drag.

So currently my only workaround for \m is to use (^|[^[:alpha:]]) though
[:alpha:] even in uk_UA.UTF-8 means latin character, thus I have to specify
symbols directly, e.g. (^|[^а-яієїґ]) which may be good if I don't care to
separate Russian and Ukrainian but if I do I'd have to be even more specific
for pure Ukrainian: (^|[^а-ьюяієїґ]) (assuming I remember about
case-sensitivity of my regexp and assuming I know UTF-8 codes).

Though I agree I missed the fact that \m is locale-specific (as it has to
know proper non-word and word chars for locale) and thus can't work for all
locales even if using Unicode and my original test in en_US locale was not
valid, it still would be nice to have two things:
1) multibyte support for locale-specific regexps like \m and [:alpha:]
2) be able to tell regexp which LC_CTYPE to use for specific invocation at
lest on SQL-statement level, this would be extremely useful for
multi-lingual projects, e.g. dictionaries (which is the type of my project
BTW), hopefully they are not to tightly connected to LC_CTYPE of the
cluster.
I understand though that these two not quite just bug fixes and will require
some effort to implement.

Thanks,
Andriy