BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

Started by Albert Cieszkowskiover 14 years ago6 messagesbugs

albert.cieszkowski@cc.com.pl

over 14 years ago

The following bug has been logged on the website:

Bug reference: 6457
Logged by: Albert Cieszkowski
Email address: albert.cieszkowski@cc.com.pl
PostgreSQL version: 9.0.6
Operating system: CentOS 5.x
Description:

OS, base and client encoding UTF-8:

peimp=> select 'Świnoujście' ~* '\mŚwinoujście\M';
?column?
----------
f
(1 row)

peimp=> select 'Świnoujście' ~* '\AŚwinoujście\Z';
?column?
----------
t
(1 row)

but:

peimp=> select 'Mróz' ~* '\mmróZ\M';
?column?
----------
t
(1 row)

peimp=> select 'Mróz' ~* '\AmróZ\Z';
?column?
----------
t
(1 row)

I believe it is connected with bug #5766 and #3433.

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Albert Cieszkowski (#1)

Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

albert.cieszkowski@cc.com.pl writes:

OS, base and client encoding UTF-8:

What's your lc_collate/lc_ctype settings?

regards, tom lane

Albert Cieszkowski

albert.cieszkowski@cc.com.pl

over 14 years ago

In reply to: Tom Lane (#2)

Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

regards, tom lane

</pre>
</blockquote>
</body>
</html>

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Albert Cieszkowski (#1)

Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

albert.cieszkowski@cc.com.pl writes:

peimp=> select 'Świnoujście' ~* '\mŚwinoujście\M';
?column?
----------
f
(1 row)

Oh, I see the reason for this: the code in cclass() in regc_locale.c
doesn't go further up than U+00FF, so no codes above that will be
thought to be letters (or members of any other character class).
Clearly we need to go further when we are dealing with UTF8.
I'm not sure what a sane limit would be though.

(It would be nice if there were a more efficient way to get this
information than laboriously iterating through all the possible
character codes. It doesn't look like we're even trying to cache
the results, ick.)

regards, tom lane

Duncan Rance

duncan@dunquino.com

over 14 years ago

In reply to: Tom Lane (#4)

Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

On 14 Feb 2012, at 18:28, Tom Lane wrote:

Oh, I see the reason for this: the code in cclass() in regc_locale.c
doesn't go further up than U+00FF, so no codes above that will be
thought to be letters (or members of any other character class).
Clearly we need to go further when we are dealing with UTF8.
I'm not sure what a sane limit would be though.

The Basic Multilingual Plane goes up to FFFF:

https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes

Duncan Rance

postgres@dunquino.com

over 14 years ago

In reply to: Tom Lane (#4)

Re: BUG #6457: Regexp not processing word (with special characters on ends) correctly (UTF-8)

On 14 Feb 2012, at 18:28, Tom Lane wrote:

Oh, I see the reason for this: the code in cclass() in regc_locale.c
doesn't go further up than U+00FF, so no codes above that will be
thought to be letters (or members of any other character class).
Clearly we need to go further when we are dealing with UTF8.
I'm not sure what a sane limit would be though.

The Basic Multilingual Plane goes up to FFFF:

https://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Planes