Re: UNICODE/UTF-8 on win32
UNICODE/UTF-8 does not work on the win32 server. The reason is that
strcoll() and friends don't work with it. To support it on win32, it
needs to be converted to UTF16 and use the wide-character versions of
the fucntion. Which we do not do.
(See
http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.php
and
http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.php)
I don't *think* we need to disable ito n the client. AFAIK, the client
interfaces don't use any of these functions, and I've seen reports of
people using that long before we had a native win32 server.
//Magnus
Show quoted text
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: den 1 januari 2005 01:10
To: tgl@sss.pgh.pa.us
Cc: Magnus Hagander; pgsql-hackers-win32@postgresql.org
Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32Sorry, but I don't subscribe to pgsql-hackers-win32 list. What's the
problem here?
--
Tatsuo Ishii"Magnus Hagander" <mha@sollentuna.net> writes:
We know it's broken and won't be fixed for 8.0.
If we just #ifndef WIN32 the definitions in
utils/mb/encnames.c it won't
be possible to select that encoding, right? Will that have
any other
unwanted effects (such as breaking client encodings)? If
not, I suggest
this is done.
I believe the subscripts in those arrays have to match the encoding
enum type, so you can't just ifdef out individual entries.(Or perhaps something can be done in pg_valid_server_encoding?)
Making the valid_server_encoding function reject it might work.
Tatsuo-san would know for sure.Should we also reject it as a client encoding, or does that work OK?
regards, tom lane
TODO updated:
o Disallow encodings like UTF8 which PostgreSQL supports
but the operating system does not (already disallowed by
pginstaller)
To fix UTF8, the data needs to be converted to UTF16 and then
the Win32 strcoll() can be used.
---------------------------------------------------------------------------
Magnus Hagander wrote:
UNICODE/UTF-8 does not work on the win32 server. The reason is that
strcoll() and friends don't work with it. To support it on win32, it
needs to be converted to UTF16 and use the wide-character versions of
the fucntion. Which we do not do.
(See
http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.php
and
http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.php)I don't *think* we need to disable ito n the client. AFAIK, the client
interfaces don't use any of these functions, and I've seen reports of
people using that long before we had a native win32 server.//Magnus
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: den 1 januari 2005 01:10
To: tgl@sss.pgh.pa.us
Cc: Magnus Hagander; pgsql-hackers-win32@postgresql.org
Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32Sorry, but I don't subscribe to pgsql-hackers-win32 list. What's the
problem here?
--
Tatsuo Ishii"Magnus Hagander" <mha@sollentuna.net> writes:
We know it's broken and won't be fixed for 8.0.
If we just #ifndef WIN32 the definitions in
utils/mb/encnames.c it won't
be possible to select that encoding, right? Will that have
any other
unwanted effects (such as breaking client encodings)? If
not, I suggest
this is done.
I believe the subscripts in those arrays have to match the encoding
enum type, so you can't just ifdef out individual entries.(Or perhaps something can be done in pg_valid_server_encoding?)
Making the valid_server_encoding function reject it might work.
Tatsuo-san would know for sure.Should we also reject it as a client encoding, or does that work OK?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
I do understand the problem, but don't undertstand the decision you
guys made. The fact that UPPER/LOWER and some other functions does not
work in win32 is surely a problem for some languages, but not a
problem for otheres. For example, Japanese (and probably Chinese and
Korean) does not have a concept upper/lower. So the fact UPPER/LOWER
does not work with UTF-8/win32 is not problem for Japanese (and for
some other languages). Just using C locale with UTF-8 is enough in
this case.
In summary, I think you guys are going to overkill the multibyte
support functionality on UTF-8/win32 because of the fact that some
langauges do not work.
Same thing can be said to EUC-JP, EUC-CN and EUC-KR and so on as well.
I strongly object the policy to try to unconditionaly disable UTF-8
support on win32.
--
Tatsuo Ishii
From: "Magnus Hagander" <mha@sollentuna.net>
Subject: RE: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
Date: Sat, 1 Jan 2005 14:48:04 +0100
Message-ID: <6BCB9D8A16AC4241919521715F4D8BCE4764A4@algol.sollentuna.se>
Show quoted text
UNICODE/UTF-8 does not work on the win32 server. The reason is that
strcoll() and friends don't work with it. To support it on win32, it
needs to be converted to UTF16 and use the wide-character versions of
the fucntion. Which we do not do.
(See
http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.php
and
http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.php)I don't *think* we need to disable ito n the client. AFAIK, the client
interfaces don't use any of these functions, and I've seen reports of
people using that long before we had a native win32 server.//Magnus
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: den 1 januari 2005 01:10
To: tgl@sss.pgh.pa.us
Cc: Magnus Hagander; pgsql-hackers-win32@postgresql.org
Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32Sorry, but I don't subscribe to pgsql-hackers-win32 list. What's the
problem here?
--
Tatsuo Ishii"Magnus Hagander" <mha@sollentuna.net> writes:
We know it's broken and won't be fixed for 8.0.
If we just #ifndef WIN32 the definitions in
utils/mb/encnames.c it won't
be possible to select that encoding, right? Will that have
any other
unwanted effects (such as breaking client encodings)? If
not, I suggest
this is done.
I believe the subscripts in those arrays have to match the encoding
enum type, so you can't just ifdef out individual entries.(Or perhaps something can be done in pg_valid_server_encoding?)
Making the valid_server_encoding function reject it might work.
Tatsuo-san would know for sure.Should we also reject it as a client encoding, or does that work OK?
regards, tom lane
I do understand the problem, but don't undertstand the decision you
guys made. The fact that UPPER/LOWER and some other functions does not
work in win32 is surely a problem for some languages, but not a
problem for otheres. For example, Japanese (and probably Chinese and
Korean) does not have a concept upper/lower. So the fact UPPER/LOWER
does not work with UTF-8/win32 is not problem for Japanese (and for
some other languages). Just using C locale with UTF-8 is enough in
this case.
The main issue is not with upper/lower, it's with ORDER BY (and doesn't
that affect indexes as well). This affects Japanese as well, no?
I didn't consider the C locale. Do you know for a fact that it works
there on win32 as well, or is that an assumption? (I don't know either
way)
In summary, I think you guys are going to overkill the multibyte
support functionality on UTF-8/win32 because of the fact that some
langauges do not work.
I was under the impression that *no* languages worked. If some do work,
then we definitly should not kill it.
It would be good to have some way of detecting if it worked or not at
the time of creation of the database. But I have no idea on how to do
that in a reasonable way.
//Magnus
Import Notes
Resolved by subject fallback
"Magnus Hagander" <mha@sollentuna.net> writes:
I didn't consider the C locale. Do you know for a fact that it works
there on win32 as well, or is that an assumption?
It should work. The only use of strcoll() in the backend is in
varstr_cmp which uses strncmp() instead for C locale. Lack of
working upper/lower is hardly a fatal objection, considering that
we never had that for UTF8 before 8.0 anyway. But you do have to
have working varstr_cmp.
It would be good to have some way of detecting if it worked or not at
the time of creation of the database. But I have no idea on how to do
that in a reasonable way.
At this point I'd say that any combination of UTF8 encoding with a non
C/POSIX locale probably isn't going to work on Windows. Tatsuo, do you
know of other cases that will work?
regards, tom lane
"Magnus Hagander" <mha@sollentuna.net> writes:
I didn't consider the C locale. Do you know for a fact that it works
there on win32 as well, or is that an assumption?It should work. The only use of strcoll() in the backend is in
varstr_cmp which uses strncmp() instead for C locale. Lack of
working upper/lower is hardly a fatal objection, considering that
we never had that for UTF8 before 8.0 anyway. But you do have to
have working varstr_cmp.It would be good to have some way of detecting if it worked or not at
the time of creation of the database. But I have no idea on how to do
that in a reasonable way.At this point I'd say that any combination of UTF8 encoding with a non
C/POSIX locale probably isn't going to work on Windows. Tatsuo, do you
know of other cases that will work?
No. I think C is the only working locale.
--
Tatsuo Ishii
I do understand the problem, but don't undertstand the decision you
guys made. The fact that UPPER/LOWER and some other functions does not
work in win32 is surely a problem for some languages, but not a
problem for otheres. For example, Japanese (and probably Chinese and
Korean) does not have a concept upper/lower. So the fact UPPER/LOWER
does not work with UTF-8/win32 is not problem for Japanese (and for
some other languages). Just using C locale with UTF-8 is enough in
this case.The main issue is not with upper/lower, it's with ORDER BY (and doesn't
that affect indexes as well). This affects Japanese as well, no?
As long as used with C locale, indexes should be ok. ORDER BY is not
perfect but we can live with it. Since Japanese is an ideogram, we
cannot rely on ORDER BY character codes to sort Japanese characters
anyway. I believe same thing can be said to Chinese.
I didn't consider the C locale. Do you know for a fact that it works
there on win32 as well, or is that an assumption? (I don't know either
way)
I have not tested 8.0 on win32, but I think it should work with C
locale since I know PowerGres, which is based on 7.4, works.
In summary, I think you guys are going to overkill the multibyte
support functionality on UTF-8/win32 because of the fact that some
langauges do not work.I was under the impression that *no* languages worked. If some do work,
then we definitly should not kill it.It would be good to have some way of detecting if it worked or not at
the time of creation of the database. But I have no idea on how to do
that in a reasonable way.
--
Tatsuo Ishii
What would it take to make the PG installer into a merge module? I
don't have the stuff to build PG so I can't build the PG install,
though I do have Wix. It would make my life (and anyone else using PG
for a specific app) a lot easier if you guys would allow us to embed
the PG install in our own install. This would let us just pass in the
setup info for the app and let PG install mostly silently. For my app,
the only thing the user needs to see from PG is the license which is
different from the commercial license on the rest of the product. The
rest I can configure from the main install. Right now the end user has
to configure things right and follow directions, and that leads to tech
support issues when they screw up. I tried using the silent install
option on the main MSI and got all sorts of problems. (Besides, many
Win2k setups with their old MSIexec don't even support a silent
install.)
=====
"We'll do the undoable, work the unworkable, scrute the inscrutable and have a long, hard look at the ineffable to see whether it might not be effed after all"
Magnus, where are we on this? Seems we should allow unicode encoding
and just not unicode locale in pginstaller.
Also, Unicode is changing to UTF-8 in 8.1.
---------------------------------------------------------------------------
Tatsuo Ishii wrote:
I do understand the problem, but don't undertstand the decision you
guys made. The fact that UPPER/LOWER and some other functions does not
work in win32 is surely a problem for some languages, but not a
problem for otheres. For example, Japanese (and probably Chinese and
Korean) does not have a concept upper/lower. So the fact UPPER/LOWER
does not work with UTF-8/win32 is not problem for Japanese (and for
some other languages). Just using C locale with UTF-8 is enough in
this case.In summary, I think you guys are going to overkill the multibyte
support functionality on UTF-8/win32 because of the fact that some
langauges do not work.Same thing can be said to EUC-JP, EUC-CN and EUC-KR and so on as well.
I strongly object the policy to try to unconditionaly disable UTF-8
support on win32.
--
Tatsuo IshiiFrom: "Magnus Hagander" <mha@sollentuna.net>
Subject: RE: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
Date: Sat, 1 Jan 2005 14:48:04 +0100
Message-ID: <6BCB9D8A16AC4241919521715F4D8BCE4764A4@algol.sollentuna.se>UNICODE/UTF-8 does not work on the win32 server. The reason is that
strcoll() and friends don't work with it. To support it on win32, it
needs to be converted to UTF16 and use the wide-character versions of
the fucntion. Which we do not do.
(See
http://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.php
and
http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg00106.php)I don't *think* we need to disable ito n the client. AFAIK, the client
interfaces don't use any of these functions, and I've seen reports of
people using that long before we had a native win32 server.//Magnus
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: den 1 januari 2005 01:10
To: tgl@sss.pgh.pa.us
Cc: Magnus Hagander; pgsql-hackers-win32@postgresql.org
Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32Sorry, but I don't subscribe to pgsql-hackers-win32 list. What's the
problem here?
--
Tatsuo Ishii"Magnus Hagander" <mha@sollentuna.net> writes:
We know it's broken and won't be fixed for 8.0.
If we just #ifndef WIN32 the definitions in
utils/mb/encnames.c it won't
be possible to select that encoding, right? Will that have
any other
unwanted effects (such as breaking client encodings)? If
not, I suggest
this is done.
I believe the subscripts in those arrays have to match the encoding
enum type, so you can't just ifdef out individual entries.(Or perhaps something can be done in pg_valid_server_encoding?)
Making the valid_server_encoding function reject it might work.
Tatsuo-san would know for sure.Should we also reject it as a client encoding, or does that work OK?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
The installer does not permit it, but initdb lets you do anything yuo
want - I think that's where we are. If you know what you're doing, you
can use it by manually initdbing.
There is no such thing as "unicode locale". Unicode (UTF8) is an
encoding, that has to be paired with a locale. I assume you mean C
locale.
While UPPER/LOWER does not matter, sort order does - for indexes if
nothing else. I'm unsure if this works - I think I read reports about
itn ot working, but I haven't tried it out myself.
I was hoping for a final solution for 8.1 which actually fixes it so it
works all the way. Not sure if I can make that happen myself, but I can
always try unless someone else does it.
//mha
Show quoted text
-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: den 22 februari 2005 04:43
To: Tatsuo IshiiMagnus, where are we on this? Seems we should allow unicode encoding
and just not unicode locale in pginstaller.Also, Unicode is changing to UTF-8 in 8.1.
---------------------------------------------------------------
------------Tatsuo Ishii wrote:
I do understand the problem, but don't undertstand the decision you
guys made. The fact that UPPER/LOWER and some otherfunctions does not
work in win32 is surely a problem for some languages, but not a
problem for otheres. For example, Japanese (and probably Chinese and
Korean) does not have a concept upper/lower. So the fact UPPER/LOWER
does not work with UTF-8/win32 is not problem for Japanese (and for
some other languages). Just using C locale with UTF-8 is enough in
this case.In summary, I think you guys are going to overkill the multibyte
support functionality on UTF-8/win32 because of the fact that some
langauges do not work.Same thing can be said to EUC-JP, EUC-CN and EUC-KR and so
on as well.
I strongly object the policy to try to unconditionaly disable UTF-8
support on win32.
--
Tatsuo IshiiFrom: "Magnus Hagander" <mha@sollentuna.net>
Subject: RE: [pgsql-hackers-win32] UNICODE/UTF-8 on win32
Date: Sat, 1 Jan 2005 14:48:04 +0100
Message-ID:<6BCB9D8A16AC4241919521715F4D8BCE4764A4@algol.sollentuna.se>
UNICODE/UTF-8 does not work on the win32 server. The reason is that
strcoll() and friends don't work with it. To support it onwin32, it
needs to be converted to UTF16 and use the wide-character
versions of
the fucntion. Which we do not do.
(Seehttp://archives.postgresql.org/pgsql-hackers-win32/2004-11/msg00036.php
and
http://archives.postgresql.org/pgsql-hackers-win32/2004-12/msg0
0106.php)I don't *think* we need to disable ito n the client.
AFAIK, the client
interfaces don't use any of these functions, and I've seen
reports of
people using that long before we had a native win32 server.
//Magnus
-----Original Message-----
From: Tatsuo Ishii [mailto:t-ishii@sra.co.jp]
Sent: den 1 januari 2005 01:10
To: tgl@sss.pgh.pa.us
Cc: Magnus Hagander; pgsql-hackers-win32@postgresql.org
Subject: Re: [pgsql-hackers-win32] UNICODE/UTF-8 on win32Sorry, but I don't subscribe to pgsql-hackers-win32 list.
What's the
problem here?
--
Tatsuo Ishii"Magnus Hagander" <mha@sollentuna.net> writes:
We know it's broken and won't be fixed for 8.0.
If we just #ifndef WIN32 the definitions in
utils/mb/encnames.c it won't
be possible to select that encoding, right? Will that have
any other
unwanted effects (such as breaking client encodings)? If
not, I suggest
this is done.
I believe the subscripts in those arrays have to match
the encoding
enum type, so you can't just ifdef out individual entries.
(Or perhaps something can be done in
pg_valid_server_encoding?)
Making the valid_server_encoding function reject it might work.
Tatsuo-san would know for sure.Should we also reject it as a client encoding, or does
that work OK?
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to
majordomo@postgresql.org
-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Import Notes
Resolved by subject fallback
Magnus Hagander wrote:
The installer does not permit it, but initdb lets you do anything yuo
want - I think that's where we are. If you know what you're doing, you
can use it by manually initdbing.There is no such thing as "unicode locale". Unicode (UTF8) is an
encoding, that has to be paired with a locale. I assume you mean C
locale.
Oh, sorry. So there is no ordering in Unicode? No wonder some
languages can't use Unicode effectively. I can see why ordering is
meaningless for creating a document that is just displayed but important
for a database.
I have added the last sentence to the TODO list:
o Disallow encodings like UTF8 which PostgreSQL supports
but the operating system does not (already disallowed by
pginstaller)
To fix UTF8, the data needs to be converted to UTF16 and then
the Win32 wcscoll() can be used, and perhaps other functions
like towupper(). However, UTF8 already works with normal
locales but provides no ordering.
While UPPER/LOWER does not matter, sort order does - for indexes if
nothing else. I'm unsure if this works - I think I read reports about
itn ot working, but I haven't tried it out myself.
I assume C just compares the bytes, meaning equality comparisons are
fine, but greater/less than is consistent but meaningless.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
To fix UTF8, the data needs to be converted to
UTF16 and then
the Win32 wcscoll() can be used, and perhaps other functions
like towupper(). However, UTF8 already works with normal
locales but provides no ordering.
Right,. So if that's fixed, then UTF8 will work only on windows?
(currently, upper/lower does not work with 2+ byte unicode characters, on any OS)
... John
Import Notes
Resolved by subject fallback
"John Hansen" <john@geeknet.com.au> writes:
Right,. So if that's fixed, then UTF8 will work only on windows?
No.
(currently, upper/lower does not work with 2+ byte unicode characters, on any OS)
This information is obsolete.
regards, tom lane
K, let me rephrase:
currently, upper/lower does not work with 2+ byte unicode characters, on any OS under the C locale.
... John
Import Notes
Resolved by subject fallback
currently, upper/lower does not work with 2+ byte unicode
characters, on any OS under the C locale.
Btw,...
There are only 15 cases in the utf8 repertoire that depends on locale, these are the only cases where pg should report:
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding.
When doing a select upper/lower (col)
All others should work just fine.
The error should probably also be changed to a warning, and just return the offending character unmodified.
... John
Import Notes
Resolved by subject fallback
Bruce Momjian wrote:
Oh, sorry. So there is no ordering in Unicode?
That statement is meaningless. Unicode is a character set, not a
collation order.
No wonder some
languages can't use Unicode effectively.
That has nothing to do with it.
o Disallow encodings like UTF8 which PostgreSQL supports
but the operating system does not (already disallowed by
pginstaller)
I think the warning that initdb shouts out is already enough for this.
I don't think we want to disallow this for people who know what they
are doing.
I assume C just compares the bytes, meaning equality comparisons are
fine, but greater/less than is consistent but meaningless.
That statement is independent of whether you use Unicode or something
else.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
John Hansen wrote:
currently, upper/lower does not work with 2+ byte unicode characters,
on any OS under the C locale.
Sure it does. It's just that the defined behavior of the C locale is
often useless in practice.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
John Hansen wrote:
currently, upper/lower does not work with 2+ byte unicode
characters,
on any OS under the C locale.
Sure it does. It's just that the defined behavior of the C
locale is often useless in practice.
select upper('æøå');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding.
Consequently it seems that is does not work.
... John
Import Notes
Resolved by subject fallback
"John Hansen" <john@geeknet.com.au> writes:
Sure it does. It's just that the defined behavior of the C
locale is often useless in practice.
select upper('æøå');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the database encoding.
Consequently it seems that is does not work.
"It fails on my machine" should not be read as "it doesn't work for anyone".
It all depends on how your local mbstowcs() works.
regards, tom lane
select upper('æøå');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probablyincompatible with the database encoding.
Consequently it seems that is does not work.
"It fails on my machine" should not be read as "it doesn't
work for anyone".
It all depends on how your local mbstowcs() works.
Ok,... Do you have an example of a system on which it works?
... John
Import Notes
Resolved by subject fallback