russian case-insensitive regexp search not working

Started by alexander lunyovalmost 19 years ago13 messageshackersgeneral
Jump to latest
#1alexander lunyov
lan@startatom.ru
hackersgeneral

Hello, friends.

OS FreeBSD 6.2, Postgresql 8.2.4

Postgresql does not search case-insensitive russian regexp unicode
patterns. Postgres is working under user pgsql with login class (in
/etc/login.conf):

postgres:\
:lang=ru_RU.UTF-8:\
:setenv=LC_COLLATE=C:\
:tc=default:

In .profile of postgres user:

LANG=ru_RU.UTF-8
export LANG
CHARSET=UTF-8
export CHARSET

Then, database:

db=> \encoding
UTF8

Case insensitive search for low-cased pattern show nothing:

db=> select street from people where street ~* 'зелен';
street
--------
(0 rows)

While there are records, but they are with first capital character:

db=> select street from people where street ~* 'Зелен';
street
----------------
Зеленая
Зеленоградская
(2 rows)

Search for english values work fine, russian values not. Why could it be?

--
alexander lunyov
lan@startatom.ru

#2alexander lunyov
lan@startatom.ru
In reply to: alexander lunyov (#1)
hackersgeneral
Re: russian case-insensitive regexp search not working

No, ILIKE also does case-sensitive search.

I found this bug report:
http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php

Is it about this issue? And will it be fixed someday?

Sergey Levchenko wrote:

Just use: select street from people where street ILIKE 'зелен%';

select with case-insensitive regexp does no work right now!

On 09/07/07, alexander lunyov <lan@startatom.ru> wrote:

Hello, friends.

OS FreeBSD 6.2, Postgresql 8.2.4

Postgresql does not search case-insensitive russian regexp unicode
patterns. Postgres is working under user pgsql with login class (in
/etc/login.conf):

postgres:\
:lang=ru_RU.UTF-8:\
:setenv=LC_COLLATE=C:\
:tc=default:

In .profile of postgres user:

LANG=ru_RU.UTF-8
export LANG
CHARSET=UTF-8
export CHARSET

Then, database:

db=> \encoding
UTF8

Case insensitive search for low-cased pattern show nothing:

db=> select street from people where street ~* 'зелен';
street
--------
(0 rows)

While there are records, but they are with first capital character:

db=> select street from people where street ~* 'Зелен';
street
----------------
Зеленая
Зеленоградская
(2 rows)

Search for english values work fine, russian values not. Why could it be?

--
alexander lunyov
lan@startatom.ru

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
alexander lunyov
lan@startatom.ru

#3Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: alexander lunyov (#2)
hackersgeneral
Re: russian case-insensitive regexp search not working

On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:

I found this bug report:
http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php

Is it about this issue?

Yes.

And will it be fixed someday?

Likely. In the meantime lower() can come to the rescue.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Karsten Hilbert (#3)
hackersgeneral
Re: russian case-insensitive regexp search not working

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:

I found this bug report:
http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php

Is it about this issue?

Yes.

And will it be fixed someday?

Likely. In the meantime lower() can come to the rescue.

It's not nearly as easy to fix as the reporter of that bug imagines,
because we don't know the relationship between pg_wchar representation
and the wchar_t representation assumed by the <wctype.h> functions.

In the short term it should work if you can use a single-byte encoding
(koi8r I suppose) and a locale setting compatible with that.

regards, tom lane

#5Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Tom Lane (#4)
hackersgeneral
Re: russian case-insensitive regexp search not working

On Mon, Jul 09, 2007 at 09:50:42AM -0400, Tom Lane wrote:

On Mon, Jul 09, 2007 at 04:00:01PM +0400, alexander lunyov wrote:

I found this bug report:
http://archives.postgresql.org/pgsql-bugs/2006-09/msg00065.php

Is it about this issue?

Yes.

And will it be fixed someday?

Likely. In the meantime lower() can come to the rescue.

It's not nearly as easy to fix as the reporter of that bug imagines,
because we don't know the relationship between pg_wchar representation
and the wchar_t representation assumed by the <wctype.h> functions.

Just to clarify: lower() on both sides of a comparison
should still work as expected on multibyte encodings ? It's
been suggested here before.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#6alexander lunyov
lan@startatom.ru
In reply to: Karsten Hilbert (#5)
hackersgeneral
Re: russian case-insensitive regexp search not working

Karsten Hilbert wrote:

Just to clarify: lower() on both sides of a comparison
should still work as expected on multibyte encodings ? It's
been suggested here before.

lower() on both sides also does not working in my case, it still search
for case-sensitive data. String in this example have first char
capitalized, and result is the same. Seems that lower() can't lower
multibyte character.

db=> select lower('Зелен');
lower
-------
Зелен
(1 запись)

--
alexander lunyov
lan@startatom.ru

#7Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: alexander lunyov (#6)
hackersgeneral
Re: russian case-insensitive regexp search not working

On Tue, Jul 10, 2007 at 08:40:24AM +0400, alexander lunyov wrote:

Just to clarify: lower() on both sides of a comparison
should still work as expected on multibyte encodings ? It's
been suggested here before.

lower() on both sides also does not working in my case, it still search for
case-sensitive data. String in this example have first char capitalized,
and result is the same. Seems that lower() can't lower multibyte character.

db=> select lower('Зелен');

Well, no,

select my_string where lower(my_string) ~ lower(search_fragment);

Does that help ?

(~ does work for eg. German in my experience)

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#8alexander lunyov
lan@startatom.ru
In reply to: Karsten Hilbert (#7)
hackersgeneral
Re: russian case-insensitive regexp search not working

Karsten Hilbert wrote:

Just to clarify: lower() on both sides of a comparison
should still work as expected on multibyte encodings ? It's
been suggested here before.

lower() on both sides also does not working in my case, it still search for
case-sensitive data. String in this example have first char capitalized,
and result is the same. Seems that lower() can't lower multibyte character.

db=> select lower('Зелен');

Well, no,

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters, even in "select lower('String')" 'String'
don't become lowercase, and further it does not work in more complex
select statement.

select my_string where lower(my_string) ~ lower(search_fragment);

Does that help ?

(~ does work for eg. German in my experience)

No, for russian unicode strings it is not working.
I searched pgsql-patches@ list and found there this thread:
http://archives.postgresql.org/pgsql-patches/2007-06/msg00021.php
I wrote Andrew (he didn't answer yet) about whether this patch can
help with my problem.

P.S.: if this issue is a known bug (as we talked earlier), then how long
will it take to fix it? I know little about postgresql development
process, maybe you know it little better?

--
alexander lunyov
lan@startatom.ru

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: alexander lunyov (#8)
hackersgeneral
Re: russian case-insensitive regexp search not working

alexander lunyov <lan@startatom.ru> writes:

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

Or [ checks back in thread... ] maybe you're using the wrong operating
system. Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

regards, tom lane

#10alexander lunyov
lan@startatom.ru
In reply to: Tom Lane (#9)
hackersgeneral
Re: russian case-insensitive regexp search not working

Tom Lane wrote:

alexander lunyov <lan@startatom.ru> writes:

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

db=> SHOW LC_CTYPE;
lc_ctype
----------
C
(1 запись)

db=> SHOW LC_COLLATE;
lc_collate
------------
C
(1 запись)

Where can i change this? Trying to SET this parameters gives error
"parameter "lc_collate" cannot be changed"

Or [ checks back in thread... ] maybe you're using the wrong operating
system. Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

I put to textfile 3 russian strings with different case of first char
and grep'ed them all:

# cat > textfile
Зеленая
Зеленодольская
зеленая
# grep -i зелен *
textfile:Зеленая
textfile:Зеленодольская
textfile:зеленая

So i think system is fine about unicode.

--
alexander lunyov
lan@startatom.ru

#11Oleg Bartunov
oleg@sai.msu.su
In reply to: alexander lunyov (#10)
hackersgeneral
Re: russian case-insensitive regexp search not working

alexander,

lc_ctype and lc_collate can be changed only at initdb !
You need to read localization chapter
http://www.postgresql.org/docs/current/static/charset.html

Oleg
On Thu, 12 Jul 2007, alexander lunyov wrote:

Tom Lane wrote:

alexander lunyov <lan@startatom.ru> writes:

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

db=> SHOW LC_CTYPE;
lc_ctype
----------
C
(1 запись)

db=> SHOW LC_COLLATE;
lc_collate
------------
C
(1 запись)

Where can i change this? Trying to SET this parameters gives error "parameter
"lc_collate" cannot be changed"

Or [ checks back in thread... ] maybe you're using the wrong operating
system. Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

I put to textfile 3 russian strings with different case of first char and
grep'ed them all:

# cat > textfile
Зеленая
Зеленодольская
зеленая
# grep -i зелен *
textfile:Зеленая
textfile:Зеленодольская
textfile:зеленая

So i think system is fine about unicode.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#12alexander lunyov
lan@startatom.ru
In reply to: Oleg Bartunov (#11)
hackersgeneral
Re: russian case-insensitive regexp search not working

Oleg Bartunov wrote:

alexander,

lc_ctype and lc_collate can be changed only at initdb !
You need to read localization chapter
http://www.postgresql.org/docs/current/static/charset.html

Yes, i knew about this, but i thought maybe somehow it can be changed
onthefly.

... (10 minutes later)

Yes, now when initdb done with --locale=ru_RU.UTF-8,
lower('RussianString') gives me 'russianstring', though, case-insensiive
regexp still not working. I guess i'll stick with lower() ~ lower()
construction.

And thanks everybody who replied!

Oleg
On Thu, 12 Jul 2007, alexander lunyov wrote:

Tom Lane wrote:

alexander lunyov <lan@startatom.ru> writes:

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

db=> SHOW LC_CTYPE;
lc_ctype
----------
C
(1 запись)

db=> SHOW LC_COLLATE;
lc_collate
------------
C
(1 запись)

Where can i change this? Trying to SET this parameters gives error
"parameter "lc_collate" cannot be changed"

Or [ checks back in thread... ] maybe you're using the wrong operating
system. Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

I put to textfile 3 russian strings with different case of first char
and grep'ed them all:

# cat > textfile
Зеленая
Зеленодольская
зеленая
# grep -i зелен *
textfile:Зеленая
textfile:Зеленодольская
textfile:зеленая

So i think system is fine about unicode.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
alexander lunyov
lan@startatom.ru

#13Oleg Bartunov
oleg@sai.msu.su
In reply to: alexander lunyov (#1)
hackersgeneral
Re: [GENERAL] russian case-insensitive regexp search not working

On Thu, 12 Jul 2007, alexander lunyov wrote:

Oleg Bartunov wrote:

alexander,

lc_ctype and lc_collate can be changed only at initdb !
You need to read localization chapter
http://www.postgresql.org/docs/current/static/charset.html

Yes, i knew about this, but i thought maybe somehow it can be changed
onthefly.

... (10 minutes later)

Yes, now when initdb done with --locale=ru_RU.UTF-8, lower('RussianString')
gives me 'russianstring', though, case-insensiive regexp still not working. I

confirmed, checked with --locale=ru_RU.UTF-8 and 8.2.4,CVS HEAD.
No problem with --locale ru_RU.KOI8-R

guess i'll stick with lower() ~ lower() construction.

And thanks everybody who replied!

Oleg
On Thu, 12 Jul 2007, alexander lunyov wrote:

Tom Lane wrote:

alexander lunyov <lan@startatom.ru> writes:

With this i just wanted to say that lower() doesn't work at all on
russian unicode characters,

In that case you're using the wrong locale (ie, not russian unicode).
Check "show lc_ctype".

db=> SHOW LC_CTYPE;
lc_ctype
----------
C
(1 запись)

db=> SHOW LC_COLLATE;
lc_collate
------------
C
(1 запись)

Where can i change this? Trying to SET this parameters gives error
"parameter "lc_collate" cannot be changed"

Or [ checks back in thread... ] maybe you're using the wrong operating
system. Not so long ago FreeBSD didn't have Unicode locale support at
all; I'm not sure if 6.2 has that problem but it is worth checking.
Does it work for you to do case-insensitive russian comparisons in
"grep", for instance?

I put to textfile 3 russian strings with different case of first char and
grep'ed them all:

# cat > textfile
Зеленая
Зеленодольская
зеленая
# grep -i зелен *
textfile:Зеленая
textfile:Зеленодольская
textfile:зеленая

So i think system is fine about unicode.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83