Japanese words not distinguished

Started by Harry Mantheakisalmost 21 years ago11 messagesgeneral
Jump to latest
#1Harry Mantheakis
harry@mantheakis.freeserve.co.uk

Hello

I run PostgreSQL 7.4.6 on Linux with a JDBC client.

I initialised my database cluster with the following initdb command:

initdb --locale=en_GB.UTF-8 --encoding UNICODE

I have now discovered that my database cannot distinguish Japanese names or
words - it throws unique constraint errors on a composite primary key that
includes a VARCHAR field which stores the names or words.

My tests indicate that the database treats all Japanese names/words as
equal.

Having searched the forum archives, it seems to me that I should have
specified "--locale=C" as the locale setting when I initialised my database
cluster.

I am planning to re-initialise my database cluster using the following
command:

initdb --locale=C --encoding UNICODE

Then, after defining the relevant groups and users, I intend to call
pg_restore with reference to a "tar.gz" dump file of my data.

I wonder if someone might be kind enough to confirm that this is the right
approach to solving the problem.

Many thanks!

Harry Mantheakis
London, UK

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Harry Mantheakis (#1)
Re: Japanese words not distinguished

Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes:

I run PostgreSQL 7.4.6 on Linux with a JDBC client.

I initialised my database cluster with the following initdb command:

initdb --locale=en_GB.UTF-8 --encoding UNICODE

I have now discovered that my database cannot distinguish Japanese names or
words - it throws unique constraint errors on a composite primary key that
includes a VARCHAR field which stores the names or words.

My tests indicate that the database treats all Japanese names/words as
equal.

Hmm, is that actually the correct spelling of the locale? On my Linux
box, locale -a says it's "en_GB.utf8". I'm not sure how well initdb can
verify the validity of a locale parameter, especially back in the 7.4
branch. It could be that you are actually using a locale that doesn't
use UTF8 encoding, in which case this behavior is not unheard of
(still pretty broken, IMHO, but I've seen plenty of locale definitions
that just fail on data outside their supported character set).

If you did correctly specify a UTF8-using locale, you probably ought to
report this behavior to your Linux supplier as a bug in that locale
definition. It doesn't have to sort or case-fold random UTF8 data very
nicely, but it certainly shouldn't report distinct strings as equal.

regards, tom lane

#3Harry Mantheakis
harry@mantheakis.freeserve.co.uk
In reply to: Tom Lane (#2)
Re: Japanese words not distinguished

Hmm, is that actually the correct spelling of the locale? On my Linux
box, locale -a says it's "en_GB.utf8". I'm not sure how well initdb can
verify the validity of a locale parameter, especially back in the 7.4
branch. It could be that you are actually using a locale that doesn't
use UTF8 encoding, in which case this behavior is not unheard of
(still pretty broken, IMHO, but I've seen plenty of locale definitions
that just fail on data outside their supported character set).

Calling "locale -a" on my Linux server also lists "en_GB.utf8".

It also lists "en_US.utf8" and yet all the related environment variables
(LC_COLLATE, etc.) indicate their locale settings is "en_US.UTF-8".

I do not know what to make of that.

If you did correctly specify a UTF8-using locale, you probably ought to
report this behavior to your Linux supplier as a bug in that locale
definition. It doesn't have to sort or case-fold random UTF8 data very
nicely, but it certainly shouldn't report distinct strings as equal.

I'll look into that - I'm running Fedora Core 3.

Meanwhile, am I correct in assuming that re-initialising my database cluster
with "--locale=C" will solve the problem?

What is more, am I correct in assuming that I can then restore my data with
pg_restore, as prescribed in the documentation?

Kind regards

Harry Mantheakis
London, UK

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Harry Mantheakis (#3)
Re: Japanese words not distinguished

Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes:

Meanwhile, am I correct in assuming that re-initialising my database cluster
with "--locale=C" will solve the problem?

AFAIK it should --- of course you won't get any very intelligent sorting
or case folding, but at least it can tell the difference between
different characters ;-). Be sure to still use encoding = unicode.

regards, tom lane

#5Harry Mantheakis
harry@mantheakis.freeserve.co.uk
In reply to: Tom Lane (#4)
Re: Japanese words not distinguished

Meanwhile, am I correct in assuming that re-initialising my database cluster
with "--locale=C" will solve the problem?

AFAIK it should --- of course you won't get any very intelligent sorting
or case folding, but at least it can tell the difference between
different characters ;-). Be sure to still use encoding = unicode.

Okay, thanks for confirming that (!)

But... will setting the C locale affect how PostgreSQL sorts Latin-1 words?

If it does, I'm in trouble! I need to be able to sort (using ORDER BY
queries) English and other European names/words. (I've tried searching for
this topic, but failed to find an answer.)

Would specifying a locale for LC_COLLATE take care of this? Perhaps
something like this:

initdb --locale=C --lc-collate=en_GB.UTF-8 --encoding UNICODE

I'm sorry to drag this out. Perhaps I cannot have it both ways.

Kind regards

Harry Mantheakis
London, UK

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Harry Mantheakis (#1)
Re: Japanese words not distinguished

Hello

I run PostgreSQL 7.4.6 on Linux with a JDBC client.

I initialised my database cluster with the following initdb command:

initdb --locale=en_GB.UTF-8 --encoding UNICODE

I have now discovered that my database cannot distinguish Japanese names or
words - it throws unique constraint errors on a composite primary key that
includes a VARCHAR field which stores the names or words.

My tests indicate that the database treats all Japanese names/words as
equal.

That's a famous problem among Japaneses PostgreSQL users since the
locale support was born.

Having searched the forum archives, it seems to me that I should have
specified "--locale=C" as the locale setting when I initialised my database
cluster.

I am planning to re-initialise my database cluster using the following
command:

initdb --locale=C --encoding UNICODE

Then, after defining the relevant groups and users, I intend to call
pg_restore with reference to a "tar.gz" dump file of my data.

I wonder if someone might be kind enough to confirm that this is the right
approach to solving the problem.

Correct. The lesson is, never use locale support for Asian languages
and multibyte encodings including UTF-8.
--
Tatsuo Ishii

#7Harry Mantheakis
harry@mantheakis.freeserve.co.uk
In reply to: Tatsuo Ishii (#6)
Re: Japanese words not distinguished

Correct. The lesson is, never use locale support for Asian languages
and multibyte encodings including UTF-8.

Thank you for your reply - much appreciated.

I'm now concerned if and how this will affect ORDER BY query results (and
other functions) with respect to Latin-1 names and words.

I think I'll have to suck it and see, and then post my results - but that
won't be until after this next week-end.

Kind regards

Harry Mantheakis
London, UK

#8Stuart Bishop
stuart@stuartbishop.net
In reply to: Harry Mantheakis (#7)
Re: Japanese words not distinguished

Harry Mantheakis wrote:

Correct. The lesson is, never use locale support for Asian languages
and multibyte encodings including UTF-8.

Thank you for your reply - much appreciated.

I'm now concerned if and how this will affect ORDER BY query results (and
other functions) with respect to Latin-1 names and words.

I think I'll have to suck it and see, and then post my results - but that
won't be until after this next week-end.

C locale and en_* locales give different ordering (at least under Linux).
The en_* ordering is case insensitive, and the C locale ordering is case
sensitive because it is simply comparing the ASCII codes.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Stuart Bishop (#8)
Re: Japanese words not distinguished

Harry Mantheakis wrote:

Correct. The lesson is, never use locale support for Asian languages
and multibyte encodings including UTF-8.

Thank you for your reply - much appreciated.

I'm now concerned if and how this will affect ORDER BY query results (and
other functions) with respect to Latin-1 names and words.

I think I'll have to suck it and see, and then post my results - but that
won't be until after this next week-end.

C locale and en_* locales give different ordering (at least under Linux).
The en_* ordering is case insensitive, and the C locale ordering is case
sensitive because it is simply comparing the ASCII codes.

You could use lower/upper to get case insensitive ordering with C locale.
--
Tatsuo Ishii

#10Harry Mantheakis
harry@mantheakis.freeserve.co.uk
In reply to: Tatsuo Ishii (#9)
Re: Japanese words not distinguished

C locale and en_* locales give different ordering (at least under Linux).
The en_* ordering is case insensitive, and the C locale ordering is case
sensitive because it is simply comparing the ASCII codes.

You could use lower/upper to get case insensitive ordering with C locale.

Okay, that's great, so I'm going to re-initialise my database cluster with
the C locale this week-end.

I'll report the results, for the record.

Thanks to everyone who replied!

Kind regards

Harry Mantheakis
London, UK

#11Harry Mantheakis
harry@mantheakis.freeserve.co.uk
In reply to: Harry Mantheakis (#10)
Re: Japanese words not distinguished - Solution

I'll report the results, for the record.

Okay, for the record, all went well. I re-initialise my PostgreSQL 7.4
database cluster using the following command:

initdb --locale=C --encoding UNICODE

Then, after defining the relevant groups and users, I used pg_restore to
restore my data from a dump file.

And now everything works as expected with both European (Latin-1) and
Japanese names and words.

Sorting with ORDER BY on European names and words is adequate for my
requirements, even though the C locale ordering is case sensitive.

As Tatsuo Ishii pointed out, I can use lower/upper to get case insensitive
ordering if I ever need it.

The bottom line seems to be: if you're planning to store data with languages
from across the world, initialise your database cluster with the C locale
and Unicode encoding.

Thanks again to everyone who replied!

Kind regards

Harry Mantheakis
London, UK