Re: Multibyte support in oracle_compat.c

Started by Tatsuo Ishiiover 23 years ago6 messages

t-ishii@sra.co.jp

over 23 years ago

2 attachment(s)

I found one bug in file src/backend/utils/adt/oracle_compat.c and there were your name, related with Multibyte enhancement, so i write to you.
Functions upper,lower and initcap doesn't work with utf-8 data which is not of Latin letters.At my work i do databases for Russian users and when i tried to use unicode encoding for database and Russsian alphabet than these functions didn't work. So i wrote some patches, because i don't think that problem is in that or other shell variable like LANG or LC_CTYPE. As i don't know any other
languages except Russian and English, i wrote small test(test.tar.gz) only for them.Execute it befor and after patching and feel the difference:). And by the way, do encodings(and appropriative languages) EUC_JP,EUC_CN,EUC_KR and EUC_TW have logical operations upper,lower and initcap?
regards,Eugene.

For EUC_JP, there is no upper,lower and initcap. I'm not sure about
other languages.

P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and towupper,towlower from wctype, but may be somebody will find decision based on them or other lib?

I'm not sure. What do you think, Peter or other guys who is familiar
with Unicode?

BTW, I don't like your patches. If there's no unicode.h, configure
aborts with:

configure: error: header file <unicode.h> is required for unicode support

which seems not acceptable to me. I suggest you #ifdef out the unicode
upper,lower and initcap support if libunicode and/or unicode.h are not
found in the system.
--
Tatsuo Ishii

(I have included patches for review purpose)

Import Notes

Reply to msg id not found: 3D75BA02.000005.14456@camay.yandex.ruReference msg id not found: 3D75BA02.000005.14456@camay.yandex.ru

Peter Eisentraut

peter_e@gmx.net

over 23 years ago

In reply to: Tatsuo Ishii (#1)

Tatsuo Ishii writes:

Functions upper,lower and initcap doesn't work with utf-8 data

The backend routines use the host OS locales, so look there. On my
machine I have several Russian locales, which seem to address the issue of
character sets:

ru_RU
ru_RU.koi8r
ru_RU.utf8
ru_UA
russian

This is bogus, because the LC_CTYPE choice is cluster-wide and the
encoding choice is database-specific (in other words: it's broken), but
there's nothing we can do about that right now.

P.S.It doesn't seem bad for me to use lib unicode instead of functions like mbtowc,wctomb from stdlib and towupper,towlower from wctype

I'm not sure. What do you think, Peter or other guys who is familiar
with Unicode?

I don't know that that libunicode is, but that shouldn't prevent us from
possibly evaluating it. :-)

Btw., I just happened to think about this very issue over the last few
days. What I would like to attack for the next release is to implement
character classification and conversion using the Unicode tables so we can
cut the LC_CTYPE system locale out of the picture. Perhaps this is what
the poster was thinking of, too.

--
Peter Eisentraut peter_e@gmx.net

Serguei A. Mokhov

sa_mokho@alcor.concordia.ca

over 23 years ago

In reply to: Peter Eisentraut (#2)

On Thu, 5 Sep 2002, Peter Eisentraut wrote:

Date: Thu, 5 Sep 2002 00:46:39 +0200 (CEST)
From: Peter Eisentraut <peter_e@gmx.net>
To: Tatsuo Ishii <t-ishii@sra.co.jp>
Cc: pgsql-hackers@postgresql.org, eutm@yandex.ru
Subject: Re: [HACKERS] Multibyte support in oracle_compat.c

Tatsuo Ishii writes:

Functions upper,lower and initcap doesn't work with utf-8 data

The backend routines use the host OS locales, so look there. On my
machine I have several Russian locales, which seem to address the issue of
character sets:

ru_RU
ru_RU.koi8r
ru_RU.utf8
ru_UA
russian

Yeah, our character sets is a major pain for internatianlization. And the
above list is not exhaustive. I guess you are right, for the time being
you'll have to bear with it.

-s

Tatsuo Ishii

t-ishii@sra.co.jp

over 23 years ago

In reply to: Peter Eisentraut (#2)

The backend routines use the host OS locales, so look there. On my
machine I have several Russian locales, which seem to address the issue of
character sets:

ru_RU
ru_RU.koi8r
ru_RU.utf8
ru_UA
russian

This is bogus, because the LC_CTYPE choice is cluster-wide and the
encoding choice is database-specific (in other words: it's broken), but
there's nothing we can do about that right now.

I thought his idea was using UTF-8 locale and Unicode (UTF-8) encoded
database.

Btw., I just happened to think about this very issue over the last few
days. What I would like to attack for the next release is to implement
character classification and conversion using the Unicode tables so we can
cut the LC_CTYPE system locale out of the picture. Perhaps this is what
the poster was thinking of, too.

Interesting idea. If you are saying that you are going to remove the
dependecy on system locale, I will agree with your idea.

BTW, nls has same problem as above, no? I guess nls depeneds on locale
and it may conflict with the database-specific encoding and/or the
automatic FE/BE encoding conversion.
--
Tatsuo Ishii

Peter Eisentraut

peter_e@gmx.net

over 23 years ago

In reply to: Tatsuo Ishii (#4)

Tatsuo Ishii writes:

BTW, nls has same problem as above, no? I guess nls depeneds on locale
and it may conflict with the database-specific encoding and/or the
automatic FE/BE encoding conversion.

GNU gettext does its own encoding conversion. It reads the program's
character encoding from the LC_CTYPE locale and converts the material in
the translation catalogs on the fly for output. This is great in general,
really, but for the postmaster it's a problem. If LC_CTYPE is fixed for
the cluster and you later on change your mind about the message language
the it will be recoded into the character set that LC_CTYPE says. And if
that character set does not match the one that is set as the backend
encoding internally then who knows what will happen when this stuff is
recoded again as it's sent to the client. Big, big mess.

--
Peter Eisentraut peter_e@gmx.net

Tatsuo Ishii

t-ishii@sra.co.jp

over 23 years ago

In reply to: Peter Eisentraut (#5)

GNU gettext does its own encoding conversion. It reads the program's
character encoding from the LC_CTYPE locale and converts the material in
the translation catalogs on the fly for output. This is great in general,
really, but for the postmaster it's a problem. If LC_CTYPE is fixed for
the cluster and you later on change your mind about the message language
the it will be recoded into the character set that LC_CTYPE says. And if
that character set does not match the one that is set as the backend
encoding internally then who knows what will happen when this stuff is
recoded again as it's sent to the client. Big, big mess.

Then in another word, it's completely broken. Sigh.
--
Tatsuo Ishii

Re: Multibyte support in oracle_compat.c

Attachments: