Encoding issues
Receiving a request to add ISO 8859-15 and 16, I review the multibyte
support code and found several errors in it.
1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
"ISO8859-5" (or whatever) as the encoding name? I think we should.
For your information, here are the correct mapping between ISO
8859-n and LATINn.
ISO 8859-1 LATIN1
ISO 8859-2 LATIN2
ISO 8859-3 LATIN3
ISO 8859-4 LATIN4
ISO 8859-9 LATIN5
ISO 8859-10 LATIN6
2) The leading characters for some Cyrillic charsets are wrong.
Currently they are defined as:
#define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */
These should be:
#define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */
The impact of correcting them would be for users who are storing
their data into database using MULE internal code. I think they
are quite few people using MULE internal code. So we could correct
them for 7.2.
Comments?
BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
some requests to do that. Supporting them are actually trivial works,
should be one day job. The harder part is writing conversion function
between encodings. However, there is very few demands to do that, I
guess. If so, we could ommit the conversion capability for 7.2.
Comments?
--
Tatsuo Ishii
1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
"ISO8859-5" (or whatever) as the encoding name? I think we should.
For your information, here are the correct mapping between ISO
8859-n and LATINn.ISO 8859-1 LATIN1
ISO 8859-2 LATIN2
ISO 8859-3 LATIN3
ISO 8859-4 LATIN4
ISO 8859-9 LATIN5
ISO 8859-10 LATIN6
I just found additions:
ISO 8859-13 LATIN7
ISO 8859-14 LATIN8
ISO 8859-15 LATIN9
--
Tatsuo Ishii
On Wed, Oct 10, 2001 at 03:40:25PM +0900, Tatsuo Ishii wrote:
Receiving a request to add ISO 8859-15 and 16, I review the multibyte
support code and found several errors in it.1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
"ISO8859-5" (or whatever) as the encoding name? I think we should.
For your information, here are the correct mapping between ISO
8859-n and LATINn.ISO 8859-1 LATIN1
ISO 8859-2 LATIN2
ISO 8859-3 LATIN3
ISO 8859-4 LATIN4
ISO 8859-9 LATIN5
ISO 8859-10 LATIN6
You are right. Now I see some old version of PostgreSQL and there
is this confusion in some headers and comments too.
2) The leading characters for some Cyrillic charsets are wrong.
Currently they are defined as:
#define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */These should be:
#define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */
Again, it's long time in sources too (interesting is that we don't
understand some bugreport).
The impact of correcting them would be for users who are storing
their data into database using MULE internal code. I think they
are quite few people using MULE internal code. So we could correct
them for 7.2.Comments?
I agree with you, make release with know bugs is dirty thing.
BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
some requests to do that. Supporting them are actually trivial works,
should be one day job. The harder part is writing conversion function
between encodings. However, there is very few demands to do that, I
guess. If so, we could ommit the conversion capability for 7.2.
Comments?
You will hear "we are in the feature freeze state.." :-)
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
http://home.zf.jcu.cz/~zakkr/
C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
Tatsuo Ishii writes:
BTW, should we support ISO 8859-6 and beyond for 7.2?
If possible we should. Otherwise people might spread the word that
PostgreSQL is not ready for the Euro.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
* Tatsuo Ishii <t-ishii@sra.co.jp> [011010 18:21]:
Receiving a request to add ISO 8859-15 and 16, I review the multibyte
support code and found several errors in it.1) There is a confusion between "LATIN5" and ISO 8859-5. LATIN5 is not
ISO 8859-5, but is actually ISO 8859-9. Should we rename LATIN5 to
"ISO8859-5" (or whatever) as the encoding name? I think we should.
For your information, here are the correct mapping between ISO
8859-n and LATINn.ISO 8859-1 LATIN1
ISO 8859-2 LATIN2
ISO 8859-3 LATIN3
ISO 8859-4 LATIN4
ISO 8859-9 LATIN5
ISO 8859-10 LATIN6
ISO-8859-14 LATIN 8
ISO-8859-15 LATIN 9 or LATIN 0
ISO-8859-16 LATIN 10
:)
2) The leading characters for some Cyrillic charsets are wrong.
Currently they are defined as:
#define LC_KOI8_R 0x8c /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8c /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8d /* ISO8859 Cyrillic */These should be:
#define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */
#define LC_KOI8_U 0x8b /* Cyrillic KOI8-U */
#define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */The impact of correcting them would be for users who are storing
their data into database using MULE internal code. I think they
are quite few people using MULE internal code. So we could correct
them for 7.2.Comments?
BTW, should we support ISO 8859-6 and beyond for 7.2? There have been
some requests to do that. Supporting them are actually trivial works,
should be one day job. The harder part is writing conversion function
between encodings. However, there is very few demands to do that, I
guess. If so, we could ommit the conversion capability for 7.2.
Comments?
I think iso-8859-15 and 16 are important, if only because they are the
only two encodings which support the Euro (not speaking of unicode, of
course !), and at least iso-8859-15 has some official status in
western europe (on Unix systems at least... Windows users have their
own table where the Euro sign is stored somewhere else, I think at
0x80).
I have done the conversion for the mappings to and from unicode, but
you could get the original tables at :
http://www.unicode.org/Public/MAPPINGS/ISO8859/
(you can get iso-8859-10, 13 and 14 there as well ! 10 is supposed to
be for greenlandic and sᅵmi, 13 for the baltic rim, and 14 for gaelic)
Just found on google the following link, where you can see quite a few
charsets (it doesn't have -16, too new probably) :
http://www.kostis.net/charsets/
Patrice
--
Patrice Hᅵdᅵ
email: patrice hede ᅵ islande org
www : http://www.islande.org/