Windows and locales and UTF-8 (oh my)
I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:
------- Forwarded Messages
Date: Fri, 05 Oct 2007 20:54:04 +0100
From: Dave Page <dpage@postgresql.org>
To: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...
Dave Page wrote:
Some further info on that - utf-8 on Windows is actually a
pseudo-codepage (65001) which doesn't have NLS files, hence why we have
to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
difference is the problem here. I'll knock up a quick test program when
the kids have gone to bed.
So, my test prog (below) returns the following:
Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001
So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 ->
UTF-16 conversions internally.
Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
Regards, Dave.
#include <locale.h>
main (int argc, char *argv[])
{
char *lc;
if (argc > 1)
setlocale(LC_ALL, argv[1]);
lc = setlocale(LC_ALL, NULL);
printf("%s\n", lc);
}
------- Message 2
Date: Fri, 05 Oct 2007 23:32:36 +0100
From: Dave Page <dpage@postgresql.org>
To: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...
Tom Lane wrote:
Dave Page <dpage@postgresql.org> writes:
So, my test prog (below) returns the following:
Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001That's just frickin' weird ... and a bit scary. There is a fair amount
of code in PG that checks for lc_ctype_is_c and does things differently;
one wonders if that isn't going to get misled by this behavior. (Hmm,
maybe this explains some of the "upper/lower doesn't work" reports we've
been getting??) Are you sure all variants of Windows act that way?
All the ones we support afaict.
Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
Is there something in Windows that constrains them to be all the same?
If not this proposal seems just plain wrong :-( But in any case I'd
feel more comfortable having it look at LC_COLLATE.
They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.
As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.
/D
------- End of Forwarded Messages
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places. ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.
That still leaves me with a boatload of questions, though. If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?
Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens? If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?
One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name. That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.
Comments? I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.
regards, tom lane
It seems like the root of the problems we're butting our heads against with
encoding and locale is all the same issue: it's nonsensical to take the locale
at initdb time per-cluster and then allow user-specified encoding
per-database. If anything it would make more sense to go the other way around.
But actually it seems to me we could allow changing both on a per-database
basis with certain restrictions:
. template0 is always SQL_ASCII with locale C
. when creating a new database you can specify the encoding and locale and we
check that they're compatible.
. when creating a new database from a template the new locale and encoding
must be identical to the template database's encoding and locale. Unless the
template is template0 in which case we rebuild all indexes after copying.
We could liberalize this last restriction if we created a new encoding like
SQL_ASCII but which enforces 7-bit ascii. But then the index rebuild step
could take a long time.
This would make the whole locale/encoding issue make much more transparent. In
database listings you would see both listed alongside, you wouldn't be bound
by any initdb environment choices, and errors when running create database
would be able to tell you exactly what you're doing wrong and what you have to
do to avoid the problem.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
. when creating a new database from a template the new locale and encoding
� must be identical to the template database's encoding and locale. Unless
the template is template0 in which case we rebuild all indexes after
copying.
Why would you restrict the index rebuilding only to this particular case? It
could be done for any database.
The other issue are shared catalogs.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
"Peter Eisentraut" <peter_e@gmx.net> writes:
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
. when creating a new database from a template the new locale and encoding
must be identical to the template database's encoding and locale. Unless
the template is template0 in which case we rebuild all indexes after
copying.Why would you restrict the index rebuilding only to this particular case? It
could be done for any database.
Well there's no guarantee there isn't 8-bit data in other databases which
would be invalid in the new encoding. I think it's reasonable to assume
there's only 7-bit ascii in template0 however.
An alternative would be introducing an ASCII7 encoding which template0 would
use and any other database in that encoding could be used as a template for
any encoding. However that would still require index rebuilds which would
potentially take a long time. Another alternative would be recoding all the
data from the template database encoding to the new encoding and throwing an
error if a non-encodable character is found.
I think it's a lot simpler to just declare it a non-problem by saying there
won't be any non-ascii text in template0.
The other issue are shared catalogs.
This approach doesn't address that but I don't think it makes the problems
there any worse either. That is, I think already have these problems around
shared tables.
. If you have two databases with locales that don't agree then the indexes on
those tables won't function properly.
. What happens if you create a user while connected to a latin1 database with
an é in his username and then connect to a database in a UTF8 database? That
username is now an invalidly encoded UTF8 string.
Perhaps we should be using pattern_ops for the indexes on the shared tables?
Or using bytea with UTF8 encoded strings instead of name and text? That
actually sounds reasonable now that we have convert() functions which take and
generate bytea, at least for the text fields like in pltemplate -- less so for
the name columns.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Fri, Oct 12, 2007 at 02:03:47PM +0100, Gregory Stark wrote:
This approach doesn't address that but I don't think it makes the problems
there any worse either. That is, I think already have these problems around
shared tables.
Or we could just setup encodings/locales per column and the problem
goes away entirely. Most of the code's already been written, it's not
even terribly difficult. Where we're stuck is that we can't agree on a
source of locale data. People don't want the ICU or glibc data and
there's no other source as readily available.
Perhaps we should fix that problem, rather than making more
workarounds.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
"Martijn van Oosterhout" <kleptog@svana.org> writes:
People don't want the ICU or glibc data and there's no other source as
readily available.Perhaps we should fix that problem, rather than making more
workarounds.
Fix the problem by making ICU a smaller less complex dependency?
Or fix the problem that glibc isn't everyone's libc?
I think realistically we're basically waiting for strcoll_l to become
standardized by POSIX so we can depend on it.
Personally I think we should just implement our own strcoll_l as a wrapper
around setlocale-strcoll-setlocale and use strcoll_l if it's available and
our, possibly slow, wrapper if not. If we ban direct use of strcoll and other
lc_collate sensitive functions in Postgres we could also remember the last
locale used and not do unnecessary setlocales so existing use cases aren't
slowed down at all.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout:
Where we're stuck is that we can't agree on a
source of locale data. People don't want the ICU or glibc data and
there's no other source as readily available.
What were the objections to ICU?
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
"Peter Eisentraut" <peter_e@gmx.net> writes:
Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout:
Where we're stuck is that we can't agree on a
source of locale data. People don't want the ICU or glibc data and
there's no other source as readily available.What were the objections to ICU?
It's introducing a new dependency to do something fundamental to Postgres, one
that's larger than all of Postgres.
It would make Postgres inconsistent and less integrated with the rest of the
OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
On Oct 12, 2007, at 10:19 , Gregory Stark wrote:
It would make Postgres inconsistent and less integrated with the
rest of the
OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system
collations?
How is this fundamentally different from PostgreSQL using a separate
users/roles system than the OS?
Michael Glaesemann
grzm seespotcode net
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
It would make Postgres inconsistent and less integrated with the rest of
the OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?
We already have our own encoding support (for better or worse), and I don't
think having one's own locale support would be that much different.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Michael Glaesemann wrote:
On Oct 12, 2007, at 10:19 , Gregory Stark wrote:
It would make Postgres inconsistent and less integrated with the rest
of the
OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system
collations?How is this fundamentally different from PostgreSQL using a separate
users/roles system than the OS?
Even more, eliminating dependencies on a OS's correct implementation of
locale stuff appears A Good Thing to me. I wonder if a compile time
option to use ICU in 8.4 should be considered, regarding all those
lengthy threads about encoding/locale/collation problems.
Regards,
Andreas
On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote:
Fix the problem by making ICU a smaller less complex dependency?
How? It's 95% data, you can't reduce that. glibc also has 10MB of locale
data. That actual code is much smaller than postgres and doesn't depend
on any other non-system libraries.
I think realistically we're basically waiting for strcoll_l to become
standardized by POSIX so we can depend on it.
I think we could be waiting forever then. It's supported by Win32,
MacOSX and glibc. The systems that don't support it tend not to support
multibyte collation anyway. Patches have been created to use this and
rejected because not enough platforms support it...
Personally I think we should just implement our own strcoll_l as a wrapper
around setlocale-strcoll-setlocale and use strcoll_l if it's available and
our, possibly slow, wrapper if not. If we ban direct use of strcoll and other
lc_collate sensitive functions in Postgres we could also remember the last
locale used and not do unnecessary setlocales so existing use cases aren't
slowed down at all.
Been done also. As I recall it was *really* slow, not just a little
bit.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Peter Eisentraut <peter_e@gmx.net> writes:
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
It would make Postgres inconsistent and less integrated with the rest of
the OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?
We already have our own encoding support (for better or worse), and I don't
think having one's own locale support would be that much different.
Well, yes it would be, because encodings are pretty well standardized;
there is not likely to be any user-visible difference between one
platform's idea of UTF8 and another's. This is very very far from being
the case for locales. See for instance the recent thread in which we
found out that "en_US" locale has utterly different sort orders on
Linux and OS X.
regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes:
On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote:
I think realistically we're basically waiting for strcoll_l to become
standardized by POSIX so we can depend on it.
I think we could be waiting forever then.
strcoll is only a small fraction of the problem anyway. The <ctype.h>
and <wctype.h> functions are another chunk of it, and then there's the
issues of system message spellings, LC_MONETARY info, etc etc.
regards, tom lane
Tom Lane wrote:
Peter Eisentraut <peter_e@gmx.net> writes:
Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
It would make Postgres inconsistent and less integrated with the rest of
the OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?We already have our own encoding support (for better or worse), and I don't
think having one's own locale support would be that much different.Well, yes it would be, because encodings are pretty well standardized;
there is not likely to be any user-visible difference between one
platform's idea of UTF8 and another's. This is very very far from being
the case for locales. See for instance the recent thread in which we
found out that "en_US" locale has utterly different sort orders on
Linux and OS X.
For me, this paragraph is more of in argument *in favour* of having our own
locale support. At least for me, consistency between PG running on different
platforms would bring more benefits than consistency between PG and the platform
it runs on.
At the company I used to work for, we had all our databases running with
encoding=utf-8 and locale=C, because I didn't want our applications to depend on
platform-specific locale issues. Plus, some of the applications supported
multiple languages, making a cluster-global locale unworkable anyway - a
restriction which would go away if we went with ICU.
regards, Florian Pflug
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:
Sorry for the late response to this. I missed the beginning and then got
mixed up in the different threads going aruond :-)
Tom Lane wrote:
Dave Page <dpage@postgresql.org> writes:
So, my test prog (below) returns the following:
Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001That's just frickin' weird ... and a bit scary. There is a fair amount
of code in PG that checks for lc_ctype_is_c and does things differently;
one wonders if that isn't going to get misled by this behavior. (Hmm,
maybe this explains some of the "upper/lower doesn't work" reports we've
been getting??) Are you sure all variants of Windows act that way?All the ones we support afaict.
AFICT, this has been standard behaviour in Windows since forever. Certainly
since Windows 2000 which is what we care about.
Windows 9x had different ways of dealing with it since they weren't native
UTF16 internally, but that doesn't matter to us here.
Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
Is there something in Windows that constrains them to be all the same?
If not this proposal seems just plain wrong :-( But in any case I'd
feel more comfortable having it look at LC_COLLATE.They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.
Yes. And also important, you can set LC_COLLATE to it, which will make all
the UTF16 versions of the functions behave properly.
Remember - all the Windows NT+ operations are UTF16 internally. So when you
set LC_TIME to it, for example, the API functions will generate the
resulting string in UTF16 and then convert it to whatever encoding you
chose - be it UTF8 or LATIN1 or whatever.
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places. ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.
Yes, I think we a change to that routine.
But. What about the case when we actually *have* locale=C and
encoding=UTF8. We need to care for that one somehow. Perhaps we should look
at LC_COLLATE instead (again, on Windows only. Possibly even only in the
windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
That still leaves me with a boatload of questions, though. If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?
GetACP() returns the "ANSI Codepage", which I *think* is what we're looking
for here.
http://msdn2.microsoft.com/en-us/library/ms776259.aspx
We should eb able to compare that to something?
Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens? If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?
AFAIK, yes, and then you get it back in the wrong encoding.
But as long as we set them to the same, we should be safe. And AFAIK, only
UTF8 (and UTF7, but we don't support that) is the special one we need to
care about.
One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name. That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.
Um, yes, that should work - assuming encoding is set to UTF8. We can't do
that for any other encoding, of course.
Comments? I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.
I have the Windows dev environment, but I feel like I'm on deep water
whenever I talk locale/encoding stuff really, I don''t know it as well as
I'd like to. But I'm happy to do coding and testing if I can get enough
pointers on whast I need to test :)
//Magnus
On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places. ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.Yes, I think we a change to that routine.
But. What about the case when we actually *have* locale=C and
encoding=UTF8. We need to care for that one somehow. Perhaps we should look
at LC_COLLATE instead (again, on Windows only. Possibly even only in the
windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
Hmm. Looking more at that, may there be another problem? Looking at
WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
will then be "C" - even if the database isn't in C.
But I don't really know when that code is called, or if I'm just looking at
things wrong. Just starting up and shutting down the database leaves it at
Swedish_Sweden.1252, not C.
(1252 is still the wrong encoding specifyer, but it'll work anyway since we
convert to UTF16)
Now, I came across this trying to find a way for lc_ctype_is_c() to
determine if the database is in C locale or not, *without* resorting to
setlocale(). Any pointers on how to do that properly?
Also, any pointers on a way to check for the kind of failure that's to be
expected from this one returning the wrong thing?
One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name. That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.Um, yes, that should work - assuming encoding is set to UTF8. We can't do
that for any other encoding, of course.
Looking at that, doesn't actually need to put that at the end of the
locale-name - all locale names will work with UTF8, even one specifying
1252.
Attached patch seems to work for me for that part. Still doesn't touch
lc_ctype_is_c().
//Magnus
Attachments:
win32_utf8.patchtext/plain; charset=us-asciiDownload
Index: backend/commands/dbcommands.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.201
diff -c -r1.201 dbcommands.c
*** backend/commands/dbcommands.c 13 Oct 2007 20:18:41 -0000 1.201
--- backend/commands/dbcommands.c 15 Oct 2007 10:55:20 -0000
***************
*** 258,264 ****
/*
* Check whether encoding matches server locale settings. We allow
! * mismatch in two cases:
*
* 1. ctype_encoding = SQL_ASCII, which means either that the locale
* is C/POSIX which works with any encoding, or that we couldn't determine
--- 258,264 ----
/*
* Check whether encoding matches server locale settings. We allow
! * mismatch in three cases:
*
* 1. ctype_encoding = SQL_ASCII, which means either that the locale
* is C/POSIX which works with any encoding, or that we couldn't determine
***************
*** 268,279 ****
--- 268,286 ----
* This is risky but we have historically allowed it --- notably, the
* regression tests require it.
*
+ * 3. selected encoding is UTF8 and platform is win32. This is because
+ * UTF8 is a pseudo codepage that is supported in all locales since
+ * it's converted to UTF16 before being used.
+ *
* Note: if you change this policy, fix initdb to match.
*/
ctype_encoding = pg_get_encoding_from_locale(NULL);
if (!(ctype_encoding == encoding ||
ctype_encoding == PG_SQL_ASCII ||
+ #ifdef WIN32
+ encoding == PG_UTF8 ||
+ #endif
(encoding == PG_SQL_ASCII && superuser())))
ereport(ERROR,
(errmsg("encoding %s does not match server's locale %s",
Index: bin/initdb/initdb.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/bin/initdb/initdb.c,v
retrieving revision 1.145
diff -c -r1.145 initdb.c
*** bin/initdb/initdb.c 13 Oct 2007 20:18:41 -0000 1.145
--- bin/initdb/initdb.c 15 Oct 2007 10:50:27 -0000
***************
*** 2840,2846 ****
/* We allow selection of SQL_ASCII --- see notes in createdb() */
if (!(ctype_enc == user_enc ||
ctype_enc == PG_SQL_ASCII ||
! user_enc == PG_SQL_ASCII))
{
fprintf(stderr, _("%s: encoding mismatch\n"), progname);
fprintf(stderr,
--- 2840,2856 ----
/* We allow selection of SQL_ASCII --- see notes in createdb() */
if (!(ctype_enc == user_enc ||
ctype_enc == PG_SQL_ASCII ||
! user_enc == PG_SQL_ASCII
! #ifdef WIN32
! /*
! * On win32, if the encoding chosen is UTF8, all locales are OK
! * (assuming the actual locale name passed the checks above). This
! * is because UTF8 is a pseudo-codepage, that we convert to UTF16
! * before doing any operations on, and UTF16 supports all locales.
! */
! || user_enc == PG_UTF8
! #endif
! ))
{
fprintf(stderr, _("%s: encoding mismatch\n"), progname);
fprintf(stderr,
On Mon, Oct 15, 2007 at 01:26:00PM +0200, Magnus Hagander wrote:
On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places. ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.Yes, I think we a change to that routine.
But. What about the case when we actually *have* locale=C and
encoding=UTF8. We need to care for that one somehow. Perhaps we should look
at LC_COLLATE instead (again, on Windows only. Possibly even only in the
windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?Hmm. Looking more at that, may there be another problem? Looking at
WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
will then be "C" - even if the database isn't in C.But I don't really know when that code is called, or if I'm just looking at
things wrong. Just starting up and shutting down the database leaves it at
Swedish_Sweden.1252, not C.
(1252 is still the wrong encoding specifyer, but it'll work anyway since we
convert to UTF16)
Gah, got that backwards. Of course it does, because it only returns "C" if
we set to Swedish_Sweden.65001, and we don't *do* that with the patch I
sent in earlier. We set it to Swedish_Sweden, which is a perfectly valid
LC_CTYPE.
And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?
//Magnus
Magnus Hagander <magnus@hagander.net> writes:
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php...
And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?
Hmm. If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.
regards, tom lane
Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:
On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php...
And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?Hmm. If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.
From what I can tell that report doesn't tell us very much - we don't
know server encoding, we don't know server locale, we don't even know
client encoding. So I don't think we know anywhere *near* enough to say
it's related to this.
//Magnus
Magnus Hagander <magnus@hagander.net> writes:
Tom Lane wrote:
Hmm. If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.
From what I can tell that report doesn't tell us very much - we don't
know server encoding, we don't know server locale, we don't even know
client encoding. So I don't think we know anywhere *near* enough to say
it's related to this.
In the followup we found out that he was using UTF-8 encoding:
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
So while that report certainly left a great deal to be desired in terms
of precision, my gut tells me it's related. Has anyone tried to
reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
Windows locale?
regards, tom lane
Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:
Tom Lane wrote:
Hmm. If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.From what I can tell that report doesn't tell us very much - we don't
know server encoding, we don't know server locale, we don't even know
client encoding. So I don't think we know anywhere *near* enough to say
it's related to this.In the followup we found out that he was using UTF-8 encoding:
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
So while that report certainly left a great deal to be desired in terms
of precision, my gut tells me it's related. Has anyone tried to
reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
Windows locale?
It doesn't tell us if it's the client or the server that's in UTF8, and
it doesn't tell us about the locale.
Euler Taveira de Oliveira's response says he can't reproduce it. I
haven't tried myself, and that webpage really doesn't tell us what what
the character is. If someone can comment on that, I can try to repro it
on my systems.
//Magnus
On Mon, Oct 15, 2007 at 07:44:00PM +0200, Magnus Hagander wrote:
Tom Lane wrote:
Magnus Hagander <magnus@hagander.net> writes:
Tom Lane wrote:
Hmm. If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.From what I can tell that report doesn't tell us very much - we don't
know server encoding, we don't know server locale, we don't even know
client encoding. So I don't think we know anywhere *near* enough to say
it's related to this.In the followup we found out that he was using UTF-8 encoding:
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
So while that report certainly left a great deal to be desired in terms
of precision, my gut tells me it's related. Has anyone tried to
reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
Windows locale?It doesn't tell us if it's the client or the server that's in UTF8, and
it doesn't tell us about the locale.Euler Taveira de Oliveira's response says he can't reproduce it. I
haven't tried myself, and that webpage really doesn't tell us what what
the character is. If someone can comment on that, I can try to repro it
on my systems.
Got some help on IRC to dentify the charafters as � and �.
I can confirm that both work perfectly fine with UTF-8 and locale
Swedish_Sweden.1252. They sort correctly, and they work with both upper()
and lower() correctly.
This test is with 8.3-HEAD and the patch to allow UTF-8.
This leads me to beleive that something is wrong with the ops system. Most
likely it's just the client that's in UTF8 mode, and the server is
SQL_ASCII.
//Magnus
Magnus Hagander wrote:
Got some help on IRC to dentify the charafters as � and �.
Exact.
I can confirm that both work perfectly fine with UTF-8 and locale
Swedish_Sweden.1252. They sort correctly, and they work with both upper()
and lower() correctly.
I didn't remember what locale is. I'll check it.
This test is with 8.3-HEAD and the patch to allow UTF-8.
I tested with 8.2.4 and my encoding is LATIN1 IIRC. Didn't try UTF-8.
I'll give it a try when i have my dev environment.
--
Euler Taveira de Oliveira
http://www.timbira.com/