Proposal: Adding JIS X 0213 support
Hi,
I would like to propose adding new character set "JIS X
0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
JIS X 0213 is a relatively new Japanese goverment standard (defined in
2000, revised in 2004), and becomes important for Japanese
users. Moreover some commercial OSs including Windows VISTA support
JIS X 0213(some open source OSs support too, of course). So I believe
supporting JIS X 0213 in upcoming 8.3 will be usefull for Japanese
users and will help spreading PostgreSQL more.
Since JIS X 0213 is a character set, we need to add encodings
supporting it. Here are lists of additional encodings (specifications
are already published by the goverment).
1) EUC-JIS-2004
prposed encoding name: EUC_JIS_2004
including following character sets:
- ASCII
- JIS X 0213 plane 1
- JIS X 0201 "katakana"
- JIS X 0213 plane 2
Note that since encoding schema of EUC_JIS_2004 is exactly identical
to EUC_JP, we can reuse existing encoding routines defined in
utls/mb/*.c.
2) Shift-JIS-2004
prposed encoding name: SHIFT_JIS_2004
including following character sets(same as EUC-JIS-2004):
- ASCII
- JIS X 0213 plane 1
- JIS X 0201 "katakana"
- JIS X 0213 plane 2
Note that this is client encoding only due to the same reason as SJIS.
Note that encoding schema of SHIFT_JIS_2004 is exactly identical to
SJIS, we can reuse existing encoding routines defined in utils/mb/*.c.
3) UTF-8
Actually already supported by the recent version of PostgreSQL and no
additional work required.
o About encoding conversion
I will add encoding conversios among EUC_JIS_2004, SHIFT_JIS_2004 and
UTF-8.
Including are patches against CVS head which should illustrate what
I'm proposing in detail. If there's no objection, I will commit them
along with documentation changes, regression updates and bump up
catalog version.
After that I will develop conversion part(it will take several days).
comments, suggestions are welcome.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Attachments:
jisx0213.patchapplication/octet-streamDownload
Index: src/backend/utils/mb/encnames.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mb/encnames.c,v
retrieving revision 1.31
diff -c -r1.31 encnames.c
*** src/backend/utils/mb/encnames.c 14 Jul 2006 14:52:25 -0000 1.31
--- src/backend/utils/mb/encnames.c 22 Mar 2007 03:55:23 -0000
***************
*** 45,50 ****
--- 45,54 ----
}, /* EUC-CN; Extended Unix Code for simplified
* Chinese */
{
+ "eucjis2004", PG_EUC_JIS_2004
+ }, /* EUC-JIS-2004; Extended UNIX Code fixed Width for
+ * Japanese, standard JIS X 0213 */
+ {
"eucjp", PG_EUC_JP
}, /* EUC-JP; Extended UNIX Code fixed Width for
* Japanese, standard OSF */
***************
*** 156,161 ****
--- 160,170 ----
{
"shiftjis", PG_SJIS
}, /* Shift_JIS; JIS X 0202-1991 */
+
+ {
+ "shiftjis2004", PG_SHIFT_JIS_2004
+ }, /* SHIFT-JIS-2004; Shift JIS for
+ * Japanese, standard JIS X 0213 */
{
"sjis", PG_SJIS
}, /* alias for Shift_JIS */
***************
*** 392,397 ****
--- 401,409 ----
"WIN1257", PG_WIN1257
},
{
+ "EUC_JIS_2004", PG_EUC_JIS_2004
+ },
+ {
"SJIS", PG_SJIS
},
{
***************
*** 405,410 ****
--- 417,425 ----
},
{
"GB18030", PG_GB18030
+ },
+ {
+ "SHIFT_JIS_2004", PG_SHIFT_JIS_2004
}
};
Index: src/backend/utils/mb/wchar.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.59
diff -c -r1.59 wchar.c
*** src/backend/utils/mb/wchar.c 24 Jan 2007 17:12:17 -0000 1.59
--- src/backend/utils/mb/wchar.c 22 Mar 2007 03:55:25 -0000
***************
*** 1346,1356 ****
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 30; PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 31; PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 32; PG_WIN1257 */
! {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* 33; PG_SJIS */
! {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* 34; PG_BIG5 */
! {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* 35; PG_GBK */
! {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* 36; PG_UHC */
! {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 2} /* 37; PG_GB18030 */
};
/* returns the byte length of a word for mule internal code */
--- 1346,1358 ----
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 30; PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 31; PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 32; PG_WIN1257 */
! {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* 33; PG_EUC_JIS_2004 */
! {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* 34; PG_SJIS */
! {0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* 35; PG_BIG5 */
! {0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* 36; PG_GBK */
! {0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* 37; PG_UHC */
! {0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 2}, /* 38; PG_GB18030 */
! {0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* 39; PG_SHIFT_JIS_2004 */
};
/* returns the byte length of a word for mule internal code */
Index: src/include/mb/pg_wchar.h
===================================================================
RCS file: /cvsroot/pgsql/src/include/mb/pg_wchar.h,v
retrieving revision 1.70
diff -c -r1.70 pg_wchar.h
*** src/include/mb/pg_wchar.h 24 Dec 2006 00:57:48 -0000 1.70
--- src/include/mb/pg_wchar.h 22 Mar 2007 03:55:27 -0000
***************
*** 187,192 ****
--- 187,193 ----
PG_WIN1254, /* windows-1254 */
PG_WIN1255, /* windows-1255 */
PG_WIN1257, /* windows-1257 */
+ PG_EUC_JIS_2004, /* EUC-JIS-2004 */
/* PG_ENCODING_BE_LAST points to the above entry */
/* followings are for client encoding only */
***************
*** 195,205 ****
PG_GBK, /* GBK (Windows-936) */
PG_UHC, /* UHC (Windows-949) */
PG_GB18030, /* GB18030 */
_PG_LAST_ENCODING_ /* mark only */
} pg_enc;
! #define PG_ENCODING_BE_LAST PG_WIN1257
/*
* Please use these tests before access to pg_encconv_tbl[]
--- 196,207 ----
PG_GBK, /* GBK (Windows-936) */
PG_UHC, /* UHC (Windows-949) */
PG_GB18030, /* GB18030 */
+ PG_SHIFT_JIS_2004, /* Shift-JIS-2004 */
_PG_LAST_ENCODING_ /* mark only */
} pg_enc;
! #define PG_ENCODING_BE_LAST PG_EUC_JIS_2004
/*
* Please use these tests before access to pg_encconv_tbl[]
Tatsuo Ishii <ishii@postgresql.org> writes:
I would like to propose adding new character set "JIS X
0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
...
Note that since encoding schema of EUC_JIS_2004 is exactly identical
to EUC_JP, we can reuse existing encoding routines defined in
utls/mb/*.c.
I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all? Why not just a documentation addition saying
they are the same thing? Or maybe rename EUC_JP to reflect the new
standard number (we've certainly renamed encodings before).
regards, tom lane
Tatsuo Ishii <ishii@postgresql.org> writes:
I would like to propose adding new character set "JIS X
0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
...
Note that since encoding schema of EUC_JIS_2004 is exactly identical
to EUC_JP, we can reuse existing encoding routines defined in
utls/mb/*.c.I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all? Why not just a documentation addition saying
they are the same thing? Or maybe rename EUC_JP to reflect the new
standard number (we've certainly renamed encodings before).
I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.
Also, EUC_JIS_2004 is *not* the super set of EUC_JP. So we need to let
EUC_JP and EUC_JIS_2004 coexist.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes:
I would like to propose adding new character set "JIS X
0213"(http://en.wikipedia.org/wiki/JIS_X_0213).
...
Note that since encoding schema of EUC_JIS_2004 is exactly identical
to EUC_JP, we can reuse existing encoding routines defined in
utls/mb/*.c.I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all? Why not just a documentation addition saying
they are the same thing? Or maybe rename EUC_JP to reflect the new
standard number (we've certainly renamed encodings before).
I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.
Also, EUC_JIS_2004 is *not* the super set of EUC_JP. So we need to let
EUC_JP and EUC_JIS_2004 coexist.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes:
I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all?
I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.
I'm still confused. If the set of characters is different, then surely
we need at least a different UTF8<->EUC_JIS_2004 conversion function?
regards, tom lane
Tatsuo Ishii <ishii@postgresql.org> writes:
I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all?I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.I'm still confused. If the set of characters is different, then surely
we need at least a different UTF8<->EUC_JIS_2004 conversion function?
Yes, exactly. I will come up with new conversions later.
After that I will develop conversion part(it will take several days).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes:
I'm confused. If this is exactly the same as EUC_JP, why do we need
any new code at all?I said *encoding schema" is same, not the contents (character set) is
same. In another word, characters included in EUC_JP are not same as
EUC_JIS_2004.I'm still confused. If the set of characters is different, then surely
we need at least a different UTF8<->EUC_JIS_2004 conversion function?Yes, exactly. I will come up with new conversions later.
I have committed changes to add JIS X 0213 along with conversions.
New encodings:
EUC_JIS_2004: JIS X 0213 encoded in EUC
SHIFT_JIS_2004: JIS X 0213 encoded in Shift JIS (client only encoding)
These encodings support following character sets:
ASCII, JIS X 0201 (single byte "katakana"), JIS X 0213 plane 1, 2
New conversions:
EUC_JIS_2004 --> UTF8: euc_jis_2004_to_utf8
UTF8 --> EUC_JIS_2004: utf8_to_euc_jis_2004
SHIFT_JIS_2004 --> UTF8: shift_jis_2004_to_utf8
UTF8 --> SHIFT_JIS_2004: utf8_to_shift_jis_2004
EUC_JIS_2004 --> SHIFT_JIS_2004: euc_jis_2004_to_shift_jis_2004
SHIFT_JIS_2004 --> EUC_JIS_2004: shift_jis_2004_to_euc_jis_2004
To generate conversion maps, I have created two perl scripts
UCS_to_SHIFT_JIS_2004.pl and UCS_to_EUC_JIS_2004.pl, which use
sjis-0213-2004-std.txt and euc-jis-2004-std.txt as the source of
conversion specification. They are freely obtained from
http://x0213.org.
Conversions to UTF-8 from EUC_JIS_2004 and SHIFT_JIS_2004
require supporting UTF-8 "combined characters" i.e. a logical
character consists of two UTF-8 characters. To implement this, I have
modified LocalToUtf() and UtfToLocal() by adding new parameter:
"combined character map".
docs changes and regression test changes are committed too.
Beware that I have updated catalog versions. Please do initdb.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo,
Related to this, when are we going to get the Japanese po files in the
core distribution?
--Josh
Tatsuo,
Related to this, when are we going to get the Japanese po files in the
core distribution?
No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?). AFAIK Hiroki Kataoka, chairman of JPUG has
same impression. Japanese po files are managed by JPUG and it would be
better to ask him or someone from JPUG who is responsible for Japanese
po files.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo,
Related to this, when are we going to get the Japanese po files in the
core distribution?
No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?). AFAIK Hiroki Kataoka, chairman of JPUG has
same impression. Japanese po files are managed by JPUG and it would be
better to ask him or someone from JPUG who is responsible for Japanese
po files.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
Related to this, when are we going to get the Japanese po files in the
core distribution?
No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?).
That's certainly true, and it's not solved. But how does keeping the
Japanese po files out of the distribution improve the matter?
regards, tom lane
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
Related to this, when are we going to get the Japanese po files in the
core distribution?No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?).That's certainly true, and it's not solved. But how does keeping the
Japanese po files out of the distribution improve the matter?
Keeping out po files until the problem is solved is just my opinion.
If JPUG (or Japanese po files maintainers/volunteers) decide to
include them into PostgreSQL distribution, I have no right to prevent
it.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
Hi.
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
Related to this, when are we going to get the Japanese po files in the
core distribution?No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?).That's certainly true, and it's not solved. But how does keeping the
Japanese po files out of the distribution improve the matter?
We are doing the support including the trouble. It was thought that the place
of JPUG was preferable for the reasons why they were problems too peculiar
to Japan. Then, The system of the support of Honda-san who was the
representative of the document team had functioned enough up to now.
However, it is not the one to refuse to do the distribution with the main body.
It should discuss it again in the document team for the reasons why the one
that was the effort to match to the release schedule of the main body becomes
stronger.
Anyway, Please wait for the response from Honda-san for a while.
Regards,
Hiroshi Saito
Hiroshi,
We are doing the support including the trouble. It was thought that the
place of JPUG was preferable for the reasons why they were problems too
peculiar to Japan.
Well, some of PostgreSQL's commercial distributors have been pretty
surprised when they package PostgreSQL and find out that the main
distribution has no Japanese support (I know because I get the confused
emails). I've an open offer from the Sun i18N people to help with this,
if they can coordinate with you.
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
Hi Josh-san.
From: "Josh Berkus" <josh@agliodbs.com>
Hiroshi,
We are doing the support including the trouble. It was thought that the
place of JPUG was preferable for the reasons why they were problems too
peculiar to Japan.Well, some of PostgreSQL's commercial distributors have been pretty
surprised when they package PostgreSQL and find out that the main
distribution has no Japanese support (I know because I get the confused
emails). I've an open offer from the Sun i18N people to help with this,
if they can coordinate with you.
Ahh yes, Certainly an offer from SUN of Japan.:-)
Then, The support was done as a volunteer with Honda-san. It seemed to be
wonderful that the spread of PostgreSQL promoted it by Solaris.!
The resource is being offered in the place where JPUG was open to the public.
I think that SUN of Japan obtained knowhow that takes there and evades the
problem. The support of the resource makes an effort to the utmost though we
are volunteers. Satisfactory results proves.!
Maybe, the problem is a release speed.... However, It might be the same even
if it puts it on official's place...
Regards,
Hiroshi Saito
Tatsuo Ishii wrote:
Related to this, when are we going to get the Japanese po files in the
core distribution?No idea. In my understanding, current message translating system has
serious problem if wrong locale and encoding is provided(has this
issue been solved in 8.3?).That's certainly true, and it's not solved. But how does keeping the
Japanese po files out of the distribution improve the matter?Keeping out po files until the problem is solved is just my opinion.
Regrettably I am also the same opinion. It is the cause of an
unnecessary trouble to include japanese po file without a certain
betterment.
--
Hiroki Kataoka <kataoka@interwiz.jp>