again: Bug #943: Server-Encoding from EUC_TW to UTF-8 doesn'twork

Started by Enke, Michaelover 22 years ago9 messages
#1Enke, Michael
michael.enke@wincor-nixdorf.com

Hello,
I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

for EUC_TW
WARNING: copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Copy out to file from table (UTF-8 data):
to BIG5
WARNING: UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

Regards,
Michael

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Enke, Michael (#1)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW to

Hello,
I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

http://www.unicode.org/Public/UNIDATA/Unihan.txt

for EUC_TW
WARNING: copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

Copy out to file from table (UTF-8 data):
to BIG5
WARNING: UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

BIG5 and EUC_TW have different code points. So this is not very strange.
--
Tatsuo Ishii

#3Enke, Michael
michael.enke@wincor-nixdorf.com
In reply to: Enke, Michael (#1)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW toUTF-8

Tatsuo Ishii wrote:

Hello,
I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
I only got a file in BIG5 encoding from Taiwan and found that it is not possible
to load all text to postgresql 7.3.3.
But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
It would be good if next release supports todays BIG5.

Michael

Show quoted text

http://www.unicode.org/Public/UNIDATA/Unihan.txt

for EUC_TW
WARNING: copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

Copy out to file from table (UTF-8 data):
to BIG5
WARNING: UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

BIG5 and EUC_TW have different code points. So this is not very strange.
--
Tatsuo Ishii

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Enke, Michael (#3)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW toUTF-8

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
I only got a file in BIG5 encoding from Taiwan and found that it is not possible
to load all text to postgresql 7.3.3.
But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
It would be good if next release supports todays BIG5.

I'm not looking forward to add any conversion entries confirmed by
standards. Can some one explain me the current status of the
conversion maps between BIG5 and Unicode? The only info I could found
so far is in www.unicode.org.
--
Tatsuo Ishii

#5Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#4)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

Sorry, I do not know anything about conversion maps and CNS 11643-1993 planes.
I only got a file in BIG5 encoding from Taiwan and found that it is not possible
to load all text to postgresql 7.3.3.
But it is possible to convert to UTF-8 with iconv tool from glibc (Linux).
It would be good if next release supports todays BIG5.

I'm not looking forward to add any conversion entries confirmed by
standards. Can some one explain me the current status of the

Oops. above should be:

I'm not looking forward to add any conversion entries NOT confirmed by
standards.

Show quoted text

conversion maps between BIG5 and Unicode? The only info I could found
so far is in www.unicode.org.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Enke, Michael (#1)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW toUTF-8

I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

http://www.unicode.org/Public/UNIDATA/Unihan.txt

I found in this file:
U+F9D7 in line 604519
U+F9D8 in line 219540
U+F9D6...U+F9DB in lines 730707...730766.

No. U+F9D6 means *Unicode* code point, not BIG5 code point.

Show quoted text

for EUC_TW
WARNING: copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

Copy out to file from table (UTF-8 data):
to BIG5
WARNING: UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

BIG5 and EUC_TW have different code points. So this is not very strange.

But it is very strange that I can (for EUC_TW) copy to file without error but I can not copy from file without error.

Michael

#7Enke, Michael
michael.enke@wincor-nixdorf.com
In reply to: Enke, Michael (#1)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW

Tatsuo Ishii wrote:

I reported bug #943 (I found in 7.3.2) and you checked in some change against integer overflow.
Now I upgraded to 7.3.3 and I'm not happy with this.
The exact error as I described is fixed, but I found new errors in conversion UTF-8 <-> EUC_TW and BIG5:

Copy to table (DB has UTF-8 encoding) from file:
for PGCLIENTENCODING=BIG5:
WARNING: copy: line 1, LocalToUtf: could not convert (0xf9d6) BIG5 to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0xf9d7) BIG5 to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0xf9d8) BIG5 to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0xf9db) BIG5 to UTF-8. Ignored

I see no problem here. The only standard conversion map I could found
on-line form so far (see below URL) does not include entries 0xf9d6 or
above.

http://www.unicode.org/Public/UNIDATA/Unihan.txt

I found in this file:
U+F9D7 in line 604519
U+F9D8 in line 219540
U+F9D6...U+F9DB in lines 730707...730766.

No. U+F9D6 means *Unicode* code point, not BIG5 code point.

Ok.
I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
% Chinese charmap for BIG5 (CP950)
% version: 0.92
% Contact: Tung-Han Hsieh <thhsieh@linux.org.tw>
% Yuan-Chung Cheng <platin@ms31.hinet.net>
% Distribution and use is free, even for comercial purpose.
%
% This charmap is converted from:
% ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
% ...

There "my" characters are in.

Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
but I can not copy "from" file without error?

Michael

Show quoted text

for EUC_TW
WARNING: copy: line 1, LocalToUtf: could not convert (0x8ea3c3b7) EUC_TW to UTF-8. Ignored
WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

Copy out to file from table (UTF-8 data):
to BIG5
WARNING: UtfToLocal: could not convert UTF-8 (0xe7a281). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe98ab9). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe8a38f). Ignored
WARNING: UtfToLocal: could not convert UTF-8 (0xe7b2a7). Ignored

to EUC_TW is ok!

BIG5 and EUC_TW have different code points. So this is not very strange.

But it is very strange that I can (for EUC_TW) copy to file without error but I can not copy from file without error.

Michael

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Enke, Michael (#7)
Re: [HACKERS] again: Bug #943: Server-Encoding from EUC_TW

I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
% Chinese charmap for BIG5 (CP950)
% version: 0.92
% Contact: Tung-Han Hsieh <thhsieh@linux.org.tw>
% Yuan-Chung Cheng <platin@ms31.hinet.net>
% Distribution and use is free, even for comercial purpose.
%
% This charmap is converted from:
% ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
% ...

There "my" characters are in.

That's a M$'s definition, not a standard. I think there should be a
reason why the Unicode org. does not use it.

Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
but I can not copy "from" file without error?

I'm not quite sure what you are saying. Are you complaining that (for
example) 0xe7a281 in UTF-8 does not convert to EUC_TW?

BTW, what do you think about below?

FYI, CNS 11643-1993 is the standard character set and EUC_TW is the
one of the encodings. That means your problem below will disappear.

WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

--
Tatsuo Ishii

#9Enke, Michael
michael.enke@wincor-nixdorf.com
In reply to: Tatsuo Ishii (#6)
Re: [HACKERS] again: Bug #943: Server-Encoding from

Tatsuo Ishii wrote:

I have looked into my Linux box and found this in /usr/share/i18n/charmaps/BIG5.gz:
% Chinese charmap for BIG5 (CP950)
% version: 0.92
% Contact: Tung-Han Hsieh <thhsieh@linux.org.tw>
% Yuan-Chung Cheng <platin@ms31.hinet.net>
% Distribution and use is free, even for comercial purpose.
%
% This charmap is converted from:
% ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
% ...

There "my" characters are in.

That's a M$'s definition, not a standard. I think there should be a
reason why the Unicode org. does not use it.

Ok, I do not know the reason. But since also the glibc uses it, couldn't you use it too?
I believe the glibc delveloper have thought about this a lot. And they came to the
conclusion to use this definition. Why not postgresql?

Don't you agree that it is strange that I can (for EUC_TW) copy "to" file without error
but I can not copy "from" file without error?

I'm not quite sure what you are saying. Are you complaining that (for
example) 0xe7a281 in UTF-8 does not convert to EUC_TW?

Yes exactly, since this value comes from a "copy to" with PGCLIENTENCODING=EUC_TW

BTW, what do you think about below?

FYI, CNS 11643-1993 is the standard character set and EUC_TW is the
one of the encodings. That means your problem below will disappear.

Ok.

Regards,
Michael

Show quoted text

WARNING: copy: line 2, LocalToUtf: could not convert (0x8ea3cfd0) EUC_TW to UTF-8. Ignored
WARNING: copy: line 3, LocalToUtf: could not convert (0x8ea3c4ce) EUC_TW to UTF-8. Ignored
WARNING: copy: line 4, LocalToUtf: could not convert (0x8ea3bdfe) EUC_TW to UTF-8. Ignored

Hum. These seem to be CNS 11643-1993, plane 3. Currently PostgreSQL
supports only:

CNS 11643-1993, plane 0
CNS 11643-1993, plane 1
CNS 11643-1993, plane 2
CNS 11643-1993, plane 15

Would you like to have support for rest of CNS 11643-1993 planes:

CNS 11643-1993, plane 3
CNS 11643-1993, plane 4
CNS 11643-1993, plane 5
CNS 11643-1993, plane 6
CNS 11643-1993, plane 7

support for upcoming 7.4?

--
Tatsuo Ishii