UTF8 or Unicode
Should our multi-byte encoding be referred to as UTF8 or Unicode?
I know UTF8 is a type of unicode but do we need to rename anything from
Unicode to UTF8?
Someone asked me via private email.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
Should our multi-byte encoding be referred to as UTF8 or Unicode?
The *encoding* should certainly be referred to as UTF-8. Unicode is a
character set, not an encoding; Unicode characters may be encoded with
UTF-8, among other things.
(One might think of a charset as being a set of integers representing
characters, and an encoding as specifying how those integers may be
converted to bytes.)
I know UTF8 is a type of unicode but do we need to rename anything
from Unicode to UTF8?
I don't know. I'll go through the documentation to see if I can find
anything that needs changing.
-- ams
Abhijit Menon-Sen wrote:
At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
Should our multi-byte encoding be referred to as UTF8 or Unicode?
The *encoding* should certainly be referred to as UTF-8. Unicode is a
character set, not an encoding; Unicode characters may be encoded with
UTF-8, among other things.(One might think of a charset as being a set of integers representing
characters, and an encoding as specifying how those integers may be
converted to bytes.)I know UTF8 is a type of unicode but do we need to rename anything
from Unicode to UTF8?I don't know. I'll go through the documentation to see if I can find
anything that needs changing.
I looked at encoding.sgml and that mentions Unicode, and then UTF8 as an
acronym. I am wondering if we need to make UTF8 first and Unicode
second. Does initdb accept UTF8 as an encoding?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Feb 14, 2005, at 9:27 PM, Abhijit Menon-Sen wrote:
I know UTF8 is a type of unicode but do we need to rename anything
from Unicode to UTF8?I don't know. I'll go through the documentation to see if I can find
anything that needs changing.
It's not the documentation that is wrong. Specifying the database
"encoding" as "Unicode" is simply a bug (see initdb). What if
postgresql supports UTF-16 in the future? What would you call it?
Also, the backend protocol also uses "UNICODE" when specifying the
encoding. All the other encoding names are specified correctly AFAICS.
I brought this up before:
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00811.php
We could make UTF8 the canonical form in the aliasing mechanism, but
beta 4 is a bit late to come up with this kind of idea.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Mon, 2005-02-14 at 22:05 -0500, Bruce Momjian wrote:
Abhijit Menon-Sen wrote:
At 2005-02-14 21:14:54 -0500, pgman@candle.pha.pa.us wrote:
Should our multi-byte encoding be referred to as UTF8 or Unicode?
The *encoding* should certainly be referred to as UTF-8. Unicode is a
character set, not an encoding; Unicode characters may be encoded with
UTF-8, among other things.(One might think of a charset as being a set of integers representing
characters, and an encoding as specifying how those integers may be
converted to bytes.)I know UTF8 is a type of unicode but do we need to rename anything
from Unicode to UTF8?I don't know. I'll go through the documentation to see if I can find
anything that needs changing.I looked at encoding.sgml and that mentions Unicode, and then UTF8 as an
acronym. I am wondering if we need to make UTF8 first and Unicode
second. Does initdb accept UTF8 as an encoding?
in PG: unicode = utf8 = utf-8
Our internal routines in src/backend/utils/mb/encnames.c accept all
synonyms. The "official" internal PG name for UTF-8 is "UNICODE" :-(
It's historical reason that UTF8 = UNICODE, because there was "UNICODE"
first. It's same like "WIN" for WIN1251 (in sources it's marked as
"_dirty_ alias")...
I think initdb uses pg_char_to_encoding() from
src/backend/utils/mb/encnames.c and it should be accept all aliases.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
Am Dienstag, 15. Februar 2005 10:22 schrieb Karel Zak:
in PG: unicode = utf8 = utf-8
Our internal routines in src/backend/utils/mb/encnames.c accept all
synonyms. The "official" internal PG name for UTF-8 is "UNICODE" :-(
I think in the SQL standard the official name is UTF8. If someone wants to
verify that this is the case and is exactly the encoding we offer (perhaps
modulo the 0x10000 issue), then it might make sense to change the canonical
form to UTF8.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Tue, 2005-02-15 at 14:33 +0100, Peter Eisentraut wrote:
Am Dienstag, 15. Februar 2005 10:22 schrieb Karel Zak:
in PG: unicode = utf8 = utf-8
Our internal routines in src/backend/utils/mb/encnames.c accept all
synonyms. The "official" internal PG name for UTF-8 is "UNICODE" :-(I think in the SQL standard the official name is UTF8. If someone wants to
verify that this is the case and is exactly the encoding we offer (perhaps
modulo the 0x10000 issue), then it might make sense to change the canonical
form to UTF8.
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.
The JDBC driver asks for a UNICODE client encoding before it knows the
server version it is talking to. How do you avoid breaking this?
-O
On Sat, 2005-02-19 at 00:27 +1300, Oliver Jowett wrote:
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.The JDBC driver asks for a UNICODE client encoding before it knows the
server version it is talking to. How do you avoid breaking this?
Fix JDBC driver as soon as possible.
Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names "UTF-8" and "WIN1215".
8.2: remove it.
OK?
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names "UTF-8" and "WIN1215".8.2: remove it.
OK?
Why on earth remove it? Just leave it in as an alias to UTF8
Chris
Karel Zak wrote:
On Sat, 2005-02-19 at 00:27 +1300, Oliver Jowett wrote:
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.The JDBC driver asks for a UNICODE client encoding before it knows the
server version it is talking to. How do you avoid breaking this?Fix JDBC driver as soon as possible.
How, exactly? Ask for a 'utf8' client encoding instead of 'UNICODE'?
Will this work if the driver is connecting to an older server?
Add to 8.1 release notes: encoding names 'UNICODE' and 'WIN' are
deprecated and it will removed in next release. Please, use correct
names "UTF-8" and "WIN1215".
8.0 appears to spell it 'utf8'.
Removing the existing aliases seems like a fairly gratuitous
incompatibility to introduce to me.
-O
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org on behalf of Oliver Jowett
Sent: Fri 2/18/2005 11:27 AM
To: Karel Zak
Cc: List pgsql-hackers
Subject: Re: [HACKERS] UTF8 or Unicode
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.The JDBC driver asks for a UNICODE client encoding before it knows the
server version it is talking to. How do you avoid breaking this?
So does pgAdmin.
Regards, Dave
Import Notes
Resolved by subject fallback
Dave Page wrote:
Karel Zak wrote:
Yes, I think we should fix it and remove UNICODE and WIN encoding names
from PG code.The JDBC driver asks for a UNICODE client encoding before it knows the
server version it is talking to. How do you avoid breaking this?So does pgAdmin.
I think we just need to _favor_ UTF8. The question is where are we
favoring Unicode rather than UTF8?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we just need to _favor_ UTF8.
I agree.
The question is where are we
favoring Unicode rather than UTF8?
It's the canonical name of the encoding, both in the code and the docs.
regression=# create database e encoding 'utf-8';
CREATE DATABASE
regression=# \l
List of databases
Name | Owner | Encoding
------------+----------+-----------
e | postgres | UNICODE
regression | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
(5 rows)
As soon as we decide whether the canonical name is "UTF8" or "UTF-8"
;-) we can fix it.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we just need to _favor_ UTF8.
I agree.
The question is where are we
favoring Unicode rather than UTF8?It's the canonical name of the encoding, both in the code and the docs.
regression=# create database e encoding 'utf-8';
CREATE DATABASE
regression=# \l
List of databases
Name | Owner | Encoding
------------+----------+-----------
e | postgres | UNICODE
regression | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
(5 rows)As soon as we decide whether the canonical name is "UTF8" or "UTF-8"
;-) we can fix it.
I checked and it looks like "UTF-8" is the correct usage:
http://www.unicode.org/glossary/
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
I do not object the changing UNICODE->UTF-8, but all these discussions
sound a little bit funny to me.
If you want to blame UNICODE, you should blame LATIN1 etc. as
well. LATIN1(ISO-8859-1) is actually a character set name, not an
encoding name. ISO-8859-1 can be encoded in 8-bit single byte
stream. But it can be encoded in 7-bit too. So when we refer to
LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.
--
Tatsuo Ishii
From: Bruce Momjian <pgman@candle.pha.pa.us>
Subject: Re: [HACKERS] UTF8 or Unicode
Date: Mon, 21 Feb 2005 22:08:25 -0500 (EST)
Message-ID: <200502220308.j1M38PV03238@candle.pha.pa.us>
Show quoted text
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think we just need to _favor_ UTF8.
I agree.
The question is where are we
favoring Unicode rather than UTF8?It's the canonical name of the encoding, both in the code and the docs.
regression=# create database e encoding 'utf-8';
CREATE DATABASE
regression=# \l
List of databases
Name | Owner | Encoding
------------+----------+-----------
e | postgres | UNICODE
regression | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
(5 rows)As soon as we decide whether the canonical name is "UTF8" or "UTF-8"
;-) we can fix it.I checked and it looks like "UTF-8" is the correct usage:
http://www.unicode.org/glossary/
-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
Tatsuo Ishii wrote:
I do not object the changing UNICODE->UTF-8, but all these discussions
sound a little bit funny to me.If you want to blame UNICODE, you should blame LATIN1 etc. as
well. LATIN1(ISO-8859-1) is actually a character set name, not an
encoding name. ISO-8859-1 can be encoded in 8-bit single byte
stream. But it can be encoded in 7-bit too. So when we refer to
LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.
Wow, Tatsuo has a point here. Looking at encnames.c, I see:
"UNICODE", PG_UTF8
but also:
"WIN", PG_WIN1251
"LATIN1", PG_LATIN1
and I see conversions for those:
"iso88591", PG_LATIN1
"win", PG_WIN1251
so I see what he is saying. We are not consistent in favoring the
official names vs. the common names.
I will work on a patch that people can review and test.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote:
We are not consistent in favoring the
official names vs. the common names.
The problem is rather that there are too many standards and conventions
to choose from.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Thu, 2005-02-24 at 23:51 -0500, Bruce Momjian wrote:
Tatsuo Ishii wrote:
I do not object the changing UNICODE->UTF-8, but all these discussions
sound a little bit funny to me.If you want to blame UNICODE, you should blame LATIN1 etc. as
well. LATIN1(ISO-8859-1) is actually a character set name, not an
encoding name. ISO-8859-1 can be encoded in 8-bit single byte
stream. But it can be encoded in 7-bit too. So when we refer to
LATIN1(ISO-8859-1), it's not clear if it's encoded in 7/8-bit.Wow, Tatsuo has a point here. Looking at encnames.c, I see:
"UNICODE", PG_UTF8
but also:
"WIN", PG_WIN1251
"LATIN1", PG_LATIN1
so I see what he is saying. We are not consistent in favoring the
official names vs. the common names.
Yes. I said already. For example "WIN" is extremely bad alias. It all is
heritage from old versions.
I will work on a patch that people can review and test.
Thanks.
Karel
--
Karel Zak <zakkr@zf.jcu.cz>
Am Freitag, 25. Februar 2005 05:51 schrieb Bruce Momjian:
so I see what he is saying. We are not consistent in favoring the
official names vs. the common names.I will work on a patch that people can review and test.
I think this is what we should do:
UNICODE => UTF8
ALT => WIN866
WIN => WIN1251
TCVN => WIN1258
That should clear it up.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/