UTF-8 collation on Windows?

Started by Dev Kumkarabout 12 years ago28 messagesgeneral
Jump to latest
#1Dev Kumkar
devdas.kumkar@gmail.com

Am really going no where with this after so many searching over net or am
missing some basic things, not sure!

What is the equivalent for "en_US.UTF-8" collation in case of windows?

In Linux am creating database with following options, as follows:
-E utf8 -l en_US.UTF-8 -T template0

This creates utf8 encoding and also the collation is set as en_US.UTF-8.

in case of windows utf8 encoding parameter works but not sure abou the
equivalent for en_US.UTF-8 collation.
Default database gets created with 'English_United States.1252' collation.

As a result the sorting is not same in case of windows and linux database.

Any suggestions/ pointers here?

Regards...

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#1)
Re: UTF-8 collation on Windows?

On 02/19/2014 06:41 AM, Dev Kumkar wrote:

Am really going no where with this after so many searching over net or
am missing some basic things, not sure!

What is the equivalent for "en_US.UTF-8" collation in case of windows?

In Linux am creating database with following options, as follows:
-E utf8 -l en_US.UTF-8 -T template0

This creates utf8 encoding and also the collation is set as en_US.UTF-8.

in case of windows utf8 encoding parameter works but not sure abou the
equivalent for en_US.UTF-8 collation.
Default database gets created with 'English_United States.1252' collation.

As a result the sorting is not same in case of windows and linux database.

Any suggestions/ pointers here?

I found the below that might help. I do not use Windows much any more so
I do not have a machine handy to confirm.

http://www.g-loaded.eu/2011/02/27/locale-windows/

Regards...

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#2)
Re: UTF-8 collation on Windows?

On Wed, Feb 19, 2014 at 10:16 PM, Adrian Klaver
<adrian.klaver@aklaver.com>wrote:

I found the below that might help. I do not use Windows much any more so I
do not have a machine handy to confirm.

http://www.g-loaded.eu/2011/02/27/locale-windows/

Thanks for the pointer. "*american_usa*" works however it sets the
LC_COLLATE to 'English_United States.1252' which is basically "ANSI Latin
1" and is not utf8. The third parameter as the link says is codeset -
"language_territory.codeset".

Here is the list of all codesets
http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx but still
no success.

One of the link says codepage 65001 and utf-8 is same -
http://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing
.

Am not able to find windows codeset equivalent of utf-8 and set it as
LC_COLLATE while creating database.

Has anyone set the LC_COLLATE as utf-8 on windows?

Regards...

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#3)
Re: UTF-8 collation on Windows?

On 02/19/2014 11:42 AM, Dev Kumkar wrote:

On Wed, Feb 19, 2014 at 10:16 PM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

I found the below that might help. I do not use Windows much any
more so I do not have a machine handy to confirm.

http://www.g-loaded.eu/2011/02/27/locale-windows/

Thanks for the pointer. "*american_usa*" works however it sets the
LC_COLLATE to 'English_United States.1252' which is basically "ANSI
Latin 1" and is not utf8. The third parameter as the link says is
codeset - "language_territory.codeset".

So what is the exact command you are using?

Here is the list of all codesets
http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx but
still no success.

One of the link says codepage 65001 and utf-8 is same -
http://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing.

Am not able to find windows codeset equivalent of utf-8 and set it as
LC_COLLATE while creating database.

Has anyone set the LC_COLLATE as utf-8 on windows?

Regards...

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#4)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 1:19 AM, Adrian Klaver <adrian.klaver@aklaver.com>wrote:

So what is the exact command you are using?

createdb -U postgres -E utf8 -l american_usa <DBNAME>
Above command fails to create utf-8 LC_COLLATE.

Regards...

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#5)
Re: UTF-8 collation on Windows?

On 02/19/2014 12:03 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 1:19 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

So what is the exact command you are using?

createdb -U postgres -E utf8 -l american_usa <DBNAME>
Above command fails to create utf-8 LC_COLLATE.

What does it set LC_CTYPE to?

So what happens if you do?:

createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>

or

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa <DBNAME>

Regards...

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#6)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 1:41 AM, Adrian Klaver <adrian.klaver@aklaver.com>wrote:

What does it set LC_CTYPE to?

So what happens if you do?:

createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>

*createdb: database creation failed: ERROR: invalid locale name:
"american_usa.65001" *

or

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa <DBNAME>

Succeeds but as replied earlier it creates database with LC_COLLATE =
'English_United States.1252' which corresponds to Latin1.

Regards...

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#7)
Re: UTF-8 collation on Windows?

On 02/19/2014 12:16 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 1:41 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

What does it set LC_CTYPE to?

So what happens if you do?:

createdb -U postgres -E utf8 -l american_usa.65001 <DBNAME>

*createdb: database creation failed: ERROR: invalid locale name:
"american_usa.65001" *

or

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa <DBNAME>

Succeeds but as replied earlier it creates database with LC_COLLATE =
'English_United States.1252' which corresponds to Latin1.

Just noticed you are not specifying the template database. Try using
template0:

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa -T template0 <DBNAME>

Regards...

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#8)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 2:01 AM, Adrian Klaver <adrian.klaver@aklaver.com>wrote:

Just noticed you are not specifying the template database. Try using
template0:

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa -T template0 <DBNAME>

Same result i.e. LC_COLLATE and LC_CTYPE gets set as 'English_United
States.1252'
Had specified template option this earlier too.

Btw in case the specified collation is incompatible then "createdb" binary
complains and instructs to use template0 as template, which isn't the case
here.

The codeset parameter is missed here and hence it takes the collation as
the default one which is 'English_United States.1252'.

Regards...

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#9)
Re: UTF-8 collation on Windows?

On 02/19/2014 12:43 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 2:01 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

Just noticed you are not specifying the template database. Try using
template0:

createdb -U postgres -E utf8 --lc-ctype=american_usa
--lc-collate=american_usa -T template0 <DBNAME>

Same result i.e. LC_COLLATE and LC_CTYPE gets set as 'English_United
States.1252'
Had specified template option this earlier too.

Btw in case the specified collation is incompatible then "createdb"
binary complains and instructs to use template0 as template, which isn't
the case here.

The codeset parameter is missed here and hence it takes the collation as
the default one which is 'English_United States.1252'.

Alright last shot:)

Taking hint from here:

http://msdn.microsoft.com/en-us/library/x99tb11d.aspx

try:

createdb -U postgres -E utf8 -l en-US <DBNAME>

If that does not work, not sure where to go.

Regards...

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#10)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 2:24 AM, Adrian Klaver <adrian.klaver@aklaver.com>wrote:

Alright last shot:)

Taking hint from here:

http://msdn.microsoft.com/en-us/library/x99tb11d.aspx

try:

createdb -U postgres -E utf8 -l en-US <DBNAME>

If that does not work, not sure where to go.

This won't work on Windows. Note that en-US collation name is specific to
linux and in case of Windows these names are different which is where am
facing issues to find the exact code page which corresponds to utf8.

Your msdn link has this mentioned which states that for code pages that
require more than two bytes per character which is basically UTF8 doesn't
work with setlocale command. But again its specific to the setlocale API.

*"The locale argument can take a locale name, a language string, a language
string and country/region code, a code page, or a
language string, country/region code, and code page. The set of available
locale names, languages, country/region codes, and code
pages includes all those supported by the Windows NLS API except code pages
that require more than two bytes per character, such as
UTF-7 and UTF-8. If you provide a code page value of UTF-7 or UTF-8,
setlocale will fail, returning NULL.*"

However am sure there would be some codepage which can be used in
postgreSQL to set the collation to UTF8 equivalent of linux.

Please suggest? Am sure this not something new which am looking for.

Regards...

#12Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#11)
Re: UTF-8 collation on Windows?

On 02/19/2014 01:09 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 2:24 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

Alright last shot:)

Taking hint from here:

http://msdn.microsoft.com/en-__us/library/x99tb11d.aspx
<http://msdn.microsoft.com/en-us/library/x99tb11d.aspx&gt;

try:

createdb -U postgres -E utf8 -l en-US <DBNAME>

If that does not work, not sure where to go.

This won't work on Windows. Note that en-US collation name is specific
to linux and in case of Windows these names are different which is where
am facing issues to find the exact code page which corresponds to utf8.

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#12)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver <adrian.klaver@aklaver.com>wrote:

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.

Yes. Here is the output:
createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
*createdb: database creation failed: ERROR: invalid locale name: "en-US"*

Regards...

#14Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#13)
Re: UTF-8 collation on Windows?

On 02/19/2014 01:21 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.

Yes. Here is the output:
createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
*createdb: database creation failed: ERROR: invalid locale name: "en-US"*

Hmmm, well I am out of ideas:(

Regards...

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15John R Pierce
pierce@hogranch.com
In reply to: Dev Kumkar (#13)
Re: UTF-8 collation on Windows?

On 2/19/2014 1:21 PM, Dev Kumkar wrote:

createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
*createdb: database creation failed: ERROR: invalid locale name: "en-US"*

I believe its en_US ... _ not -

--
john r pierce 37N 122W
somewhere on the middle of the left coast

#16Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Adrian Klaver (#14)
Re: UTF-8 collation on Windows?

On 20/02/14 10:28, Adrian Klaver wrote:

On 02/19/2014 01:21 PM, Dev Kumkar wrote:

On Thu, Feb 20, 2014 at 2:45 AM, Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:

Have you tried it?

Note that the locale name is different then the one Linux.

On Linux it is en_US.

What I suggested is en-US.

Yes. Here is the output:
createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
*createdb: database creation failed: ERROR: invalid locale name:
"en-US"*

Hmmm, well I am out of ideas:(

Regards...

Upgrade servers to Linux? :-P

Cheers,
Gavin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Adrian Klaver
adrian.klaver@aklaver.com
In reply to: John R Pierce (#15)
Re: UTF-8 collation on Windows?

On 02/19/2014 01:30 PM, John R Pierce wrote:

On 2/19/2014 1:21 PM, Dev Kumkar wrote:

createdb -U postgres -E utf8 -l en-US -T template0 mynewdb
Password:
*createdb: database creation failed: ERROR: invalid locale name: "en-US"*

I believe its en_US ... _ not -

Unfortunately this is a Windows install and that does not work either.

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Adrian Klaver
adrian.klaver@aklaver.com

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18John R Pierce
pierce@hogranch.com
In reply to: Adrian Klaver (#17)
Re: UTF-8 collation on Windows?

On 2/19/2014 1:35 PM, Adrian Klaver wrote:

Unfortunately this is a Windows install and that does not work either.

windows encodings are a pain. their Unicode is NOT utf8, its ucs2 aka
utf16. I just checked my default install of potsgres 9.2, it appears
its using WIN1252 encoding, another bastard, this is a modified ISO-8859-1

--
john r pierce 37N 122W
somewhere on the middle of the left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Dev Kumkar
devdas.kumkar@gmail.com
In reply to: John R Pierce (#18)
Re: UTF-8 collation on Windows?

On Thu, Feb 20, 2014 at 3:17 AM, John R Pierce <pierce@hogranch.com> wrote:

On 2/19/2014 1:35 PM, Adrian Klaver wrote:

Unfortunately this is a Windows install and that does not work either.

windows encodings are a pain. their Unicode is NOT utf8, its ucs2 aka
utf16. I just checked my default install of potsgres 9.2, it appears its
using WIN1252 encoding, another bastard, this is a modified ISO-8859-1

Yes I guess that will be a different topic altogether to discuss. As
windows internally uses UTF-16 encoding. I have one use case wherein will
have to change the client_encoding in ODBC so that Unicode characters on
windows are stored correctly. Currently on linux machine same unicode data
is stored correctly but in case of windows it is not by default, so there
are tweaks there required as well. But separate topic later.

Coming back here, yes by default it is WIN1252 on windows. So is there no
way I could achieve linux alike utf-8 collation?
Linux and windows sorting order behaves differently in this case.

Regards...

#20Daniel Verite
daniel@manitou-mail.org
In reply to: Dev Kumkar (#7)
Re: UTF-8 collation on Windows?

Dev Kumkar wrote:

Succeeds but as replied earlier it creates database with LC_COLLATE =
'English_United States.1252' which corresponds to Latin1.

Despite windows-1252 being a monobyte encoding sharing most
of LATIN1 codes and character set, it does not mean that
English_United States.1252 is limited to this character set.
You may use UTF-8 databases with that locale.

Consider the 2nd paragraph of "Character Set Support"
in the doc:
http://www.postgresql.org/docs/current/static/multibyte.html

"For C or POSIX locale, any character set is allowed, but for other
locales there is only one character set that will work
correctly. (On Windows, however, UTF-8 encoding can be used with
any locale.)"

This is a key difference with Unix when choosing a locale.

As for getting the exact same sort order than Linux, it's not possible but
that's not a Windows-vs-Unix issue. If you used FreeBSD or MacOS X, some
en_US.UTF-8 collation rules would differ from Linux's libc too, resulting in
a different sort order for certain strings.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Daniel Verite (#20)
#22Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Gavin Flower (#16)
#23Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Dev Kumkar (#22)
#24Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Gavin Flower (#23)
#25Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#24)
#26Dev Kumkar
devdas.kumkar@gmail.com
In reply to: Adrian Klaver (#25)
#27Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#24)
#28Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dev Kumkar (#26)