BUG #4319: lower()/upper() does not know about UNICODE case mapping

Started by Valentine Gogichashvilialmost 18 years ago10 messagesbugs

valgog@gmail.com

almost 18 years ago

The following bug has been logged online:

Bug reference: 4319
Logged by: Valentine Gogichashvili
Email address: valgog@gmail.com
PostgreSQL version: 8.3.1
Operating system: SuSE Linux (kernel 2.6.13-15.11-default)
Description: lower()/upper() does not know about UNICODE case mapping
Details:

Hi,

I understand, that it is more a feature, but it does not help me anyways...

On the UNICODE databases lower and upper functions are using system locale
settings (that cannot be changed after initializing DB?) and does not know
anything about UNICODE case mapping.

The problem really becomes 'a problem' on multilingual systems. I have to
store data for German, Russian and Romanian languages together.

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I can understand, that LC_COLLATE is something that is not clear from the
context of the character. But case pair is always defined in the UNICODE
standard and should be not dependent of the LC_CTYPE. Or am I wrong?

Regards,

-- Valentine Gogichashvili

Alvaro Herrera

alvherre@2ndquadrant.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

Valentine Gogichashvili escribiï¿½:

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
which is why it's corrupting the data. Try en_EN.utf8 instead.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Valentine Gogichashvili

valgog@gmail.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

On Jul 21, 4:32 pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:

Valentine Gogichashvili escribió:

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
which is why it's corrupting the data. Try en_EN.utf8 instead.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-bugs

Will en_EN.utf8 know about German, Russian, Romanian and Dutch?

Valentine Gogichashvili

valgog@gmail.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

On Jul 21, 4:32 pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:

Valentine Gogichashvili escribió:

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
which is why it's corrupting the data. Try en_EN.utf8 instead.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-bugs

Why Postgres allows creating UNICODE database with the locale, that
can possibly corrupt my data?

Peter Eisentraut

peter_e@gmx.net

almost 18 years ago

In reply to: Valentine Gogichashvili (#4)

Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

Am Tuesday, 22. July 2008 schrieb valgog:

Why Postgres allows creating UNICODE database with the locale, that
can possibly corrupt my data?

It doesn't allow it, as of 8.3. In 8.2 it does, but we have fixed that, for
the reasons that are becoming obvious to you now.

Perhaps part of the problem is that en_EN isn't actually a valid locale, as
far as I can tell, unless SUSE has invented a new country. :) Try locale -a
and pick one from that list.

Valentine Gogichashvili

valgog@gmail.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

On Jul 22, 11:53 am, pete...@gmx.net (Peter Eisentraut) wrote:

Am Tuesday, 22. July 2008 schrieb valgog:

Why Postgres allows creating UNICODE database with the locale, that
can possibly corrupt my data?

It doesn't allow it, as of 8.3. In 8.2 it does, but we have fixed that, for
the reasons that are becoming obvious to you now.

Perhaps part of the problem is that en_EN isn't actually a valid locale, as
far as I can tell, unless SUSE has invented a new country. :) Try locale -a
and pick one from that list.

--
Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-bugs

Ok, I have checked in databases with locale set to one of the utf8
locales convert data correctly. But I still do not understand, why
postgres initdb allows using non-UTF8 locales together with UNICODE
database encoding.

Another question. Is it possible to change the current Database
LC_CTYPE on the database without recreating it with initdb and
reimporting all the data. I would rather prefer my indexes recreated,
rather then reimporting database (cannot afford such a long time out
of service :(

Valentine Gogichashvili

valgog@gmail.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

I checked for de_DE.utf8 (created a test database with initdb) and
lower/upper is working correctly there on both 8.2 and 8.3 in
different languages

Show quoted text

On Jul 22, 11:26 am, valgog <val...@gmail.com> wrote:

On Jul 21, 4:32 pm, alvhe...@commandprompt.com (Alvaro Herrera) wrote:

Valentine Gogichashvili escribió:

On 8.2.3 database with LC_CTYPE set to en_EN, lower() function is actually
corrupting UTF8 data, lowering UTF8 control bytes... I did have a chance to
check if how it works on the 8.3 as I do not have any db instance with the
LC_CTYPE set to en_EN.

I don't know about Unicode case mapping, but en_EN is not a UTF8 locale
which is why it's corrupting the data. Try en_EN.utf8 instead.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

--
Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-bugs

Will en_EN.utf8 know about German, Russian, Romanian and Dutch?

Tomasz Ostrowski

tometzky@batory.org.pl

almost 18 years ago

In reply to: Valentine Gogichashvili (#6)

Re: Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

On 2008-07-23 11:48, valgog wrote:

Another question. Is it possible to change the current Database
LC_CTYPE on the database without recreating it with initdb and
reimporting all the data.

No.

I would rather prefer my indexes recreated, rather then reimporting
database (cannot afford such a long time out of service :(

You can use slony replication for minimal downtime (seconds), but it is
rather complicated.

You can also seriously reduce reimport time by:
- temporary disabling sync;
- temporary enabling hard disk write cache;
- importing and exporting concurrently with a pipe:
old_pgdata = /var/lib/pgsql/data
new_pgdata = /var/lib/pgsql/newdata
initdb -D $new_pgdata
postgres -p 10001 -F -D $new_pgdata &
su -c 'service postgresql stop'
postgres -p 10002 -F -D $old_pgdata &
pg_dumpall -p 10002 | pg_restore -p 10001
killall postgres
mv $old_pgdata $old_pgdata.bak
mv $new_pgdata $old_pgdata
su -c 'service postgresql start'
- importing and exporting from one disk or cluster to another

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh

Zdenek Kotala

Zdenek.Kotala@Sun.COM

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

Valentine Gogichashvili napsal(a):

Hi,

I understand, that it is more a feature, but it does not help me anyways...

On the UNICODE databases lower and upper functions are using system locale
settings (that cannot be changed after initializing DB?) and does not know
anything about UNICODE case mapping.

The problem really becomes 'a problem' on multilingual systems. I have to
store data for German, Russian and Romanian languages together.

I understand you and I hope things will be better. See collation per database
project: http://wiki.postgresql.org/wiki/Gsoc08-collation

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#10

Valentine Gogichashvili

valgog@gmail.com

almost 18 years ago

In reply to: Valentine Gogichashvili (#1)

Re: BUG #4319: lower()/upper() does not know about UNICODE case mapping

On Jul 25, 10:17 am, Zdenek.Kot...@Sun.COM (Zdenek Kotala) wrote:

Valentine Gogichashvili napsal(a):

Hi,

Hi

I understand, that it is more a feature, but it does not help me anyways...

On the UNICODE databases lower and upper functions are using system locale
settings (that cannot be changed after initializing DB?) and does not know
anything about UNICODE case mapping.

The problem really becomes 'a problem' on multilingual systems. I have to
store data for German, Russian and Romanian languages together.

I understand you and I hope things will be better. See collation per database
project:http://wiki.postgresql.org/wiki/Gsoc08-collation

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

--
Sent via pgsql-bugs mailing list (pgsql-b...@postgresql.org)
To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-bugs

This is actually a high time to include support for different
collates... but do you think it is enough to add the change to the
database or maybe start building the collation support for every text
column actually?