Re: [PATCHES] Postgres-6.3.2 locale patch

Started by Thomas G. Lockhartover 27 years ago14 messages

lockhart@alumni.caltech.edu

over 27 years ago

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

I would propose to do this for v6.4 as user-defined packages (with
compile-time parser support) on top of the existing USE_LOCALE and MB
patches so that the existing compile-time options are not changed or
damaged.

So, the initial questions:

1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
non-English applications? Do other databases use this SQL92 convention,
or does it have difficulties?

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

3) I'd like to implement the existing Cyrillic and EUC-jp character
sets, and also some European languages (French and ??) which use the
Latin-1 alphabet but might have different collation sequences. Any
suggestions for candidates??

- Tom

Import Notes

Reference msg id not found: Pine.LNX.3.96.SK.980603113546.10030J-101000@torus.comus.ru

Patrice Hédé

patrice@idf.net

over 27 years ago

In reply to: Thomas G. Lockhart (#1)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch

Hi Tom,

I would propose to do this for v6.4 as user-defined packages (with
compile-time parser support) on top of the existing USE_LOCALE and MB
patches so that the existing compile-time options are not changed or
damaged.

Be careful that system locales may not be here, though you may need the
locale information in Postgres. They may also be broken (which is in fact
often the case), so don't depend on them.

So, the initial questions:

1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
non-English applications? Do other databases use this SQL92 convention,
or does it have difficulties?

Don't know (yet).

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

I can help for french, icelandic, and german and norwegian (though for the
two last ones, I guess there are more appropriate persons on this list :).

3) I'd like to implement the existing Cyrillic and EUC-jp character
sets, and also some European languages (French and ??) which use the
Latin-1 alphabet but might have different collation sequences. Any
suggestions for candidates??

They all have, as soon as we take care of accents, which are all put at
the end with an english system. And of course, they are different for each
language :)

Patrice

PS : I'm sorry, Tom, I haven't been able to work on the faq for the past
month :(( because I've been busy in my free time learning norwegian ! I
will submit something very soon, I promise !

--
Patrice Hï¿½Dï¿½ --------------------------------- patrice@idf.net -----
... Looking for a job in Iceland or in Norway !
Ingï¿½nieur informaticien - Computer engineer - Tï¿½lvufrï¿½ï¿½ingur
----- http://www.idf.net/patrice/ ----------------------------------

Noname

t-ishii@sra.co.jp

over 27 years ago

In reply to: Patrice Hédé (#2)

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

Sounds interesting idea... But before going into discussion, Let me
make clarify what "character sets" means. A character sets consists of
some characters. One of the most famous character set is ISO646
(almost same as ASCII). In western Europe, ISO 8859 series character
sets are widely used. For example, ISO 8859-1 includes English,
French, German etc. and ISO 8859-2 includes Albanian, Romanian
etc. These are "single byte" and there is one to many correspondacne
between the character set and Languages.

Example1:
ISO 8859-1 <------> English, French, German

On the other hand, some asian languages such as Japanese, Chinese, and
Korean do not correspond to a chacter set, rather correspond to
multiple character sets.

Example2:
ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese
(ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets)

An "encoding" is a way to represent set of charactser sets in
computers. The above set of characters sets are encoded in the EUC_JP
encdoing.

I think SQL92 uses a term "character set" as encoding.

So, the initial questions:

1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
non-English applications? Do other databases use this SQL92 convention,
or does it have difficulties?

As far as I know, there is no commercial RDBMS that supports
NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple
encodings. An encoding for a database is defined while creating the
database and cannot be changed at runtime. Clients can use different
encoding as long as it is a "subset" of the database's encoding. For
example, a oracle client can use ASCII if the database encoding is
EUC_JP.

I think the idea that the "default" encoding for a database being
defined at the database creation time is nice.

create database with encoding EUC_JP;

If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user
could use a different encoding other than EUC_JP. Sound very nice too.

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

I would be able to help you in the Japanese part. For Chinese and
Korean, I'm going to find volunteers in the local PostgreSQL mailing
list I'm running if necessary.

3) I'd like to implement the existing Cyrillic and EUC-jp character
sets, and also some European languages (French and ??) which use the
Latin-1 alphabet but might have different collation sequences. Any
suggestions for candidates??

Collation sequences for EUC_JP? How nice it would be! One of a problem
for collation sequences for multi-byte encodings is the sequence might
become huge. Seems you have a solution for that. Please let me know
more details.
--
Tatsuo Ishii
t-ishii@sra.co.jp

Import Notes

Reply to msg id not found: YourmessageofWed03Jun1998142417GMT.35755C91.15D4BA9B@alumni.caltech.edu | Resolved by subject fallback

Oleg Broytmann

phd@comus.ru

over 27 years ago

In reply to: Thomas G. Lockhart (#1)

Hello!

On Wed, 3 Jun 1998, Thomas G. Lockhart wrote:

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

All this sounds nice, but I am afraid the job is not for me. Actually I
am very new to Postgres and SQL world. I started to learn SQL 3 months ago;
I started to play with Postgres 2 months ago. I started to hack Potsgres
sources (about locale) a little more than a month ago.

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

It would be nice to test it, providing that it wouldn't break existing
code. Our site is running hundreds CGIs that rely on current locale support
in Postgres...

Oleg.
----
Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

Peter Mount

peter@taer.maidstone.gov.uk

over 27 years ago

In reply to: Oleg Broytmann (#4)

Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

Ok, I'm English, but I'll keep a close eye on this topic as the JDBC
driver has two methods that handle Unicode strings.

Currently, they simply call the Ascii/Binary methods. But they could (when
NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation
between the character set and Unicode.

I would propose to do this for v6.4 as user-defined packages (with
compile-time parser support) on top of the existing USE_LOCALE and MB
patches so that the existing compile-time options are not changed or
damaged.

In a same vein, for getting JDBC up to speed with this, we may need to
have a function on the backend that will handle the translation between
the encoding and Unicode. This would allow the JDBC driver to
automatically handle a new character set without having to write a class
for each package.

--
Peter Mount, peter@maidstone.gov.uk
Postgres email to peter@taer.maidstone.gov.uk & peter@retep.org.uk
Remember, this is my work email, so please CC my home address, as I may
not always have time to reply from work.

Import Notes

Resolved by subject fallback

Noname

t-ishii@sra.co.jp

over 27 years ago

In reply to: Peter Mount (#5)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

In a same vein, for getting JDBC up to speed with this, we may need to
have a function on the backend that will handle the translation between
the encoding and Unicode. This would allow the JDBC driver to
automatically handle a new character set without having to write a class
for each package.

I already have a patch to handle the translation on the backend
between the encoding and SJIS (yet another encoding for Japanese).
Translation for other encodings such as Big5(Chinese) and Unicode are
in my plan.

The biggest problem for Unicode is that the translation is not
symmetrical. An encoding to Unicode is ok. However, Unicode to an
encoding is like one-to-many. The reason for that is "Unification." A
code point of Unicode might correspond to either Chinese, Japanese or
Korean. To determine that, we need additional infomation what language
we are using. Too bad. Any idea?
---
Tatsuo Ishii
t-ishii@sra.co.jp

Import Notes

Reply to msg id not found: YourmessageofThu04Jun1998084731+0100.Pine.LNX.3.95.980604084711.22446B-100000@taer.maidstone.gov.uk | Resolved by subject fallback

Peter Mount

peter@taer.maidstone.gov.uk

over 27 years ago

In reply to: Noname (#6)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote:

In a same vein, for getting JDBC up to speed with this, we may need to
have a function on the backend that will handle the translation between
the encoding and Unicode. This would allow the JDBC driver to
automatically handle a new character set without having to write a class
for each package.

I already have a patch to handle the translation on the backend
between the encoding and SJIS (yet another encoding for Japanese).
Translation for other encodings such as Big5(Chinese) and Unicode are
in my plan.

The biggest problem for Unicode is that the translation is not
symmetrical. An encoding to Unicode is ok. However, Unicode to an
encoding is like one-to-many. The reason for that is "Unification." A
code point of Unicode might correspond to either Chinese, Japanese or
Korean. To determine that, we need additional infomation what language
we are using. Too bad. Any idea?

I'm not sure. I brought this up as it's something that I feel should be
done somewhere in the backend, rather than in the clients, and should be
thought about at this stage.

I was thinking on the lines of a function that handled the translation
between any two given encodings (ie it's told what the initial and final
encodings are), and returns the translated string (be it single or
multi-byte). It could then throw an error if the translation between the
two encodings is not possible, or (optionally) that part of the
translation would fail.

Also, having this in the backend would allow all the interfaces access to
international encodings without too much work. Adding a new encoding can
then be done just on the server (say by adding a module), without having
to recompile/link everything else.

Jose' Soares Da Silva

sferac@bo.nettuno.it

over 27 years ago

In reply to: Noname (#3)

On Thu, 4 Jun 1998 t-ishii@sra.co.jp wrote:

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think
I've worked out how to do the implementation (not the details, just a
strategy) so that multiple character sets will be allowed in a single
database, additional character sets can be loaded at run-time, and so
that everything will behave transparently.

Sounds interesting idea... But before going into discussion, Let me
make clarify what "character sets" means. A character sets consists of
some characters. One of the most famous character set is ISO646
(almost same as ASCII). In western Europe, ISO 8859 series character
sets are widely used. For example, ISO 8859-1 includes English,
French, German etc. and ISO 8859-2 includes Albanian, Romanian
etc. These are "single byte" and there is one to many correspondacne
between the character set and Languages.

Example1:
ISO 8859-1 <------> English, French, German

On the other hand, some asian languages such as Japanese, Chinese, and
Korean do not correspond to a chacter set, rather correspond to
multiple character sets.

Example2:
ASCII, JIS X0208, JIS X0201, JIS X0212 <-------> Japanese
(ASCII, JIS X0208, JIS X0201, JIS X0212 are individual character sets)

An "encoding" is a way to represent set of charactser sets in
computers. The above set of characters sets are encoded in the EUC_JP
encdoing.

I think SQL92 uses a term "character set" as encoding.

So, the initial questions:

1) Is the NCHAR/NVARCHAR/CHARACTER SET syntax and usage acceptable for
non-English applications? Do other databases use this SQL92 convention,
or does it have difficulties?

As far as I know, there is no commercial RDBMS that supports
NCHAR/NVARCHAR/CHARACTER SET syntax. Oracle supports multiple
encodings. An encoding for a database is defined while creating the
database and cannot be changed at runtime. Clients can use different
encoding as long as it is a "subset" of the database's encoding. For
example, a oracle client can use ASCII if the database encoding is
EUC_JP.

I try the following databases on Linux and no one has this feature:
. MySql
. Solid
. Empress
. Kubl
. ADABAS D

I found only one under M$-Windows that implement this feature:
. OCELOT
I'm playing with it, but so far I don't understand its behavior.
There's an interesting documentation about it on OCELOT manual,
if you want I can send it to you.

I think the idea that the "default" encoding for a database being
defined at the database creation time is nice.

create database with encoding EUC_JP;

If NCHAR/NVARCHAR/CHARACTER SET syntax would be supported, a user
could use a different encoding other than EUC_JP. Sound very nice too.

2) Would anyone be interested in helping to define the character sets
and helping to test? I don't know the correct collation sequences and
don't think they would display properly on my screen...

I would be able to help you in the Japanese part. For Chinese and
Korean, I'm going to find volunteers in the local PostgreSQL mailing
list I'm running if necessary.

I may help with Italian, Spanish and Portuguese.

3) I'd like to implement the existing Cyrillic and EUC-jp character
sets, and also some European languages (French and ??) which use the
Latin-1 alphabet but might have different collation sequences. Any
suggestions for candidates??

Collation sequences for EUC_JP? How nice it would be! One of a problem
for collation sequences for multi-byte encodings is the sequence might
become huge. Seems you have a solution for that. Please let me know
more details.
--
Tatsuo Ishii
t-ishii@sra.co.jp

Ciao, Jose'

Thomas G. Lockhart

lockhart@alumni.caltech.edu

over 27 years ago

In reply to: Jose' Soares Da Silva (#8)

Sounds interesting idea... But before going into discussion, Let me
make clarify what "character sets" means.
An "encoding" is a way to represent set of charactser sets in
computers.
I think SQL92 uses a term "character set" as encoding.

I have found the SQL92 terminology confusing, because they do not seem
to make the nice clear distinction between encoding and collation
sequence which you have pointed out. I suppose that there can be an
issue of visual appearance of an alphabet for different locales also.

afaik, SQL92 uses the term "character set" to mean an encoding with an
implicit collation sequence. SQL92 allows alternate collation sequences
to be specified for a "character set" when it can be made meaningful.

I would propose to implement
VARCHAR(length) WITH CHARACTER SET setname

as a type with a type name of, for example, "VARSETNAME". This type
would have the comparison functions and operators which implement
collation sequences.

I would propose to implement
VARCHAR(length) WITH CHARACTER SET setname COLLATION collname

as a type with a name of, for example, "VARCOLLNAME". For the EUC-jp
encoding, "collname" could be "Korean" or "Japanese" so the type name
would become "varkorean" or "varjapanese". Don't know for sure yet
whether this is adequate, but other possibilities can be used if
necessary.

When a database is created, it can be specified with a default character
set/collation sequence for the database; this would correspond to the
NCHAR/NVARCHAR/NTEXT types. We could implement a
SET NATIONAL CHARACTER SET = 'language';

command to determine the default character set for the session when
NCHAR is used.

The SQL92 technique for specifying an encoding/collation sequence in a
literal string is
_language 'string'

so for example to specify a string in the French language (implying an
encoding, collation, and representation?) you would use
_FRENCH 'string'

I would be able to help you in the Japanese part. For Chinese and
Korean, I'm going to find volunteers in the local PostgreSQL mailing
list I'm running if necessary.

I may help with Italian, Spanish and Portuguese.

Great, and perhaps Oleg could help test with Cyrillic (I assume I can
steal code from the existing "CYR_LOCALE" blocks in the Postgres
backend).

Collation sequences for EUC_JP? How nice it would be! One of a
problem for collation sequences for multi-byte encodings is the
sequence might become huge. Seems you have a solution for that.
Please let me know more details.

Um, no, I just assume we can find a solution :/ I'd like to implement
the infrastructure in the Postgres parser to allow multiple
encodings/collations, and then see where we are. As I mentioned, this
would be done for v6.4 as a transparent add-on, so that existing
capabilities are not touched or damaged. Implementing everything for
some European languages (with the 1-byte Latin-1 encoding?) may be
easiest, but the Asian languages might be more fun :)

- Tom

#10

Noname

dg@illustra.com

over 27 years ago

In reply to: Peter Mount (#5)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

Someone whos headers I am too lazy to retreive wrote:

On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:

Hi. I'm looking for non-English-using Postgres hackers to participate in
implementing NCHAR() and alternate character sets in Postgres. I think

...

Currently, they simply call the Ascii/Binary methods. But they could (when
NCHAR/NVARCHAR/CHARACTER SET is the columns type) handle the translation
between the character set and Unicode.

I would propose to do this for v6.4 as user-defined packages (with
compile-time parser support) on top of the existing USE_LOCALE and MB
patches so that the existing compile-time options are not changed or
damaged.

In a same vein, for getting JDBC up to speed with this, we may need to
have a function on the backend that will handle the translation between
the encoding and Unicode. This would allow the JDBC driver to
automatically handle a new character set without having to write a class
for each package.

Just an observation or two on the topic of internationalization:

Illustra went to unicode internally. This allowed things like kanji table
names etc. It worked, but it was very costly in terms of work, bugs, and
especially performance although we eventually got most of it back.

Then we created encodings (char set, sort order, error messages etc) for
a bunch of languages. Then we made 8 bit chars convert to unicode and
assumed 7 bit chars were in 7-bit ascii.

This worked and was in some sense "the right thing to do".

But, the european customers hated it. Before, when we were "plain ole
Amuricans, don't hold with this furrin stuff", we ignored 8 vs 7 bit
issues and the europeans were free to stick any characters they wanted
in and get them out unchanged and it was just as fast as anything else.

When we changed to unicode and 7 vs 8 bit sensitivity it forced everyone
to install an encoding and store their data in unicode. Needless to say
customers in eg Germany did not want to double their disk space and give
up performance to do something only a little better than they could do
already.

Ultimately, we backed it out and allowed 8 bit chars again. You could still
get unicode, but except for asian sites it was not widely used, and even in
asia it was not universally popular.

Bottom line, I am not opposed to internationalization. But, it is harder
even than it looks. And some of the "correct" technical solutions turn
out to be pretty annoying in the real world.

So, having it as an add on is fine. Providing support in the core is fine
too. An incremental approach of perhaps adding sort orders for 8 bit char
sets today and something else next release might be ok. But, be very very
careful and do not accept that the "popular" solutions are useable or try
to solve the "whole" problem in one grand effort.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"And there _is_ a real world. In fact, some of you
are in it right now." -- Gene Spafford

#11

Oleg Broytmann

phd@comus.ru

over 27 years ago

In reply to: Thomas G. Lockhart (#9)

Hi!

On Thu, 4 Jun 1998, Thomas G. Lockhart wrote:

Great, and perhaps Oleg could help test with Cyrillic (I assume I can
steal code from the existing "CYR_LOCALE" blocks in the Postgres
backend).

Before sending my patch to pgsql-patches I gave it out to few testers
here. It wouldn't be too hard to find testers for Cyrillic support, sure.

Oleg.
----
Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#12

Satoshi Kinoshita

kinsa01@cai.com

over 27 years ago

In reply to: Noname (#6)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

The biggest problem for Unicode is that the translation is not
symmetrical. An encoding to Unicode is ok. However, Unicode to an
encoding is like one-to-many. The reason for that is "Unification." A
code point of Unicode might correspond to either Chinese, Japanese or
Korean. To determine that, we need additional infomation what language
we are using. Too bad. Any idea?

It seems not that bad for the translation from Unicode to Japanese EUC
(or SJIS or Big5).
Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only CJK).
Right?
It would be virtually one-to-one or one-to-none when translating
from unicode to them mono-lingual encodings.
It, however, would not be that simple to translate from Unicdoe to
another multi-lingual encoding(like iso-2022 based Mule encoding?).

Kinoshita

#13

Noname

t-ishii@sra.co.jp

over 27 years ago

In reply to: Satoshi Kinoshita (#12)

Re: [HACKERS] Re: [PATCHES] Postgres-6.3.2 locale patch (fwd)

The biggest problem for Unicode is that the translation is not
symmetrical. An encoding to Unicode is ok. However, Unicode to an
encoding is like one-to-many. The reason for that is "Unification." A
code point of Unicode might correspond to either Chinese, Japanese or
Korean. To determine that, we need additional infomation what language
we are using. Too bad. Any idea?

It seems not that bad for the translation from Unicode to Japanese EUC
(or SJIS or Big5).
Because Japanese EUC(or SJIS) has only Japanese characters and Big5 has only Chinese characters(regarding to only CJK).
Right?
It would be virtually one-to-one or one-to-none when translating
from unicode to them mono-lingual encodings.

Oh, I was wrong. We have already an information about "what language
we are using" when try to make a translation between Unicode and
Japanese EUC:-)

It, however, would not be that simple to translate from Unicdoe to
another multi-lingual encoding(like iso-2022 based Mule encoding?).

Correct.
--
Tatsuo Ishii
t-ishii@sra.co.jp

Import Notes

Reply to msg id not found: YourmessageofFri12Jun1998154716JST.3580CEF4.32C3B4C9@acm.org | Resolved by subject fallback

#14

Noname

t-ishii@sra.co.jp

over 27 years ago

In reply to: Noname (#13)

When a database is created, it can be specified with a default character
set/collation sequence for the database; this would correspond to the
NCHAR/NVARCHAR/NTEXT types. We could implement a
SET NATIONAL CHARACTER SET = 'language';

In the current implementation of MB, the encoding used by BE is
determined at the compile time. This time I would like to add more
flexibility in that the encoding can be specified when creating a
database. I would like to add a new option to the CREATE DATABASE
statement:

CREATE DATABASE WITH ENCODING 'encoding';

I'm not sure if this kind of thing is defined in the
standard. Suggestion?
--
Tatsuo Ishii
t-ishii@sra.co.jp

Import Notes

Reply to msg id not found: YourmessageofThu04Jun1998150711GMT.3576B81F.D222AD6A@alumni.caltech.edu | Resolved by subject fallback