BUG #1721: mutiple bytes character string comaprison error

Started by Chii-Tung Liualmost 21 years ago9 messagesbugs
Jump to latest
#1Chii-Tung Liu
cdliou@mail.cyut.edu.tw

The following bug has been logged online:

Bug reference: 1721
Logged by: Chii-Tung Liu
Email address: cdliou@mail.cyut.edu.tw
PostgreSQL version: 8.0.3
Operating system: Windows XP SP2
Description: mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
WITH OWNER = postgres
ENCODING = 'UNICODE'
TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 中文'

3. SELECT title from node where title > '1.1 '
would return '1 中文'

4. Both SELECT '1 中文' > '1.1' and SELECT '1.1' > '1 中文' return
FALSE

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Chii-Tung Liu (#1)
Re: BUG #1721: mutiple bytes character string comaprison error

"Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:

PostgreSQL version: 8.0.3
Operating system: Windows XP SP2

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

regards, tom lane

#3Kris Jurka
books@ejurka.com
In reply to: Tom Lane (#2)
Re: BUG #1721: mutiple bytes character string comaprison

On Sun, 19 Jun 2005, Tom Lane wrote:

"Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:

PostgreSQL version: 8.0.3
Operating system: Windows XP SP2

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then? At least a strongly worded
warning? We see these complaints too often.

Kris Jurka

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kris Jurka (#3)
Re: BUG #1721: mutiple bytes character string comaprison error

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

We see these complaints too often.

There are lots of complaints we see way too often ;-) ... but
distressingly, there are still only 24 hours in a day.

regards, tom lane

#5Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Chii-Tung Liu (#1)
Re: BUG #1721: mutiple bytes character string comaprison

The following bug has been logged online:

Bug reference: 1721
Logged by: Chii-Tung Liu
Email address: cdliou@mail.cyut.edu.tw
PostgreSQL version: 8.0.3
Operating system: Windows XP SP2
Description: mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
WITH OWNER = postgres
ENCODING = 'UNICODE'
TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 中文'

3. SELECT title from node where title > '1.1 '
would return '1 中文'

4. Both SELECT '1 中文' > '1.1' and SELECT '1.1' > '1 中文' return
FALSE

I think you need to use C locale.
--
Tatsuo Ishii

#6Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: BUG #1721: mutiple bytes character string comaprison error

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7John Hansen
john@geeknet.com.au
In reply to: Bruce Momjian (#6)
Re: BUG #1721: mutiple bytes character string comaprison error

UTF8 encoding works with the C locale assuming you don't care
about ordering of the character set, e.g. Japanese.

Has anyone with the ability to compile postgresql on windows tested the
ICU patch?

... John

#8Magnus Hagander
magnus@hagander.net
In reply to: John Hansen (#7)
Re: BUG #1721: mutiple bytes character string comaprison error

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

Has anyone with the ability to compile postgresql on windows
tested the ICU patch?

Yes.
See http://archives.postgresql.org/pgsql-hackers/2005-05/msg00662.php

//Magnus

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#6)
Re: BUG #1721: mutiple bytes character string comaprison

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

No, sometimes Japanese needs char ordering too and I think this is not
a Windows only problem. The real problem is Unicode defines char
orderes in totally random manner because Chinese/Japanese/Korean Kanji
characters are "Unified" in Unicode. To solve the problem, we can use
convert UTF8 to EUC_JP using CONVERT. See archives for more details.

Or you can use Unicode locale only if your platform's locale database
is not broken and you only use single locale.
--
Tatsuo Ishii