BUG #1721: mutiple bytes character string comaprison error

Started by Chii-Tung Liualmost 21 years ago9 messagesbugs

cdliou@mail.cyut.edu.tw

almost 21 years ago

The following bug has been logged online:

Bug reference: 1721
Logged by: Chii-Tung Liu
Email address: cdliou@mail.cyut.edu.tw
PostgreSQL version: 8.0.3
Operating system: Windows XP SP2
Description: mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
WITH OWNER = postgres
ENCODING = 'UNICODE'
TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 中文'

3. SELECT title from node where title > '1.1 '
would return '1 中文'

4. Both SELECT '1 中文' > '1.1' and SELECT '1.1' > '1 中文' return
FALSE

Tom Lane

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Chii-Tung Liu (#1)

Re: BUG #1721: mutiple bytes character string comaprison error

"Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:

PostgreSQL version: 8.0.3
Operating system: Windows XP SP2

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

regards, tom lane

Kris Jurka

books@ejurka.com

almost 21 years ago

In reply to: Tom Lane (#2)

Re: BUG #1721: mutiple bytes character string comaprison

On Sun, 19 Jun 2005, Tom Lane wrote:

"Chii-Tung Liu" <cdliou@mail.cyut.edu.tw> writes:

PostgreSQL version: 8.0.3
Operating system: Windows XP SP2

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then? At least a strongly worded
warning? We see these complaints too often.

Kris Jurka

Tom Lane

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Kris Jurka (#3)

Re: BUG #1721: mutiple bytes character string comaprison error

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

We see these complaints too often.

There are lots of complaints we see way too often ;-) ... but
distressingly, there are still only 24 hours in a day.

regards, tom lane

Tatsuo Ishii

t-ishii@sra.co.jp

almost 21 years ago

In reply to: Chii-Tung Liu (#1)

Re: BUG #1721: mutiple bytes character string comaprison

The following bug has been logged online:

Bug reference: 1721
Logged by: Chii-Tung Liu
Email address: cdliou@mail.cyut.edu.tw
PostgreSQL version: 8.0.3
Operating system: Windows XP SP2
Description: mutiple bytes character string comaprison error
Details:

When compare two UTF-8 encoded string that contains Chinese words, the
result is always TRUE
1. create a database test with encoding set to unicode
CREATE DATABASE test
WITH OWNER = postgres
ENCODING = 'UNICODE'
TABLESPACE = pg_default;
2. insert data with Chinese words
INSERT into node set title='1 中文'

3. SELECT title from node where title > '1.1 '
would return '1 中文'

4. Both SELECT '1 中文' > '1.1' and SELECT '1.1' > '1 中文' return
FALSE

I think you need to use C locale.
--
Tatsuo Ishii

Bruce Momjian

bruce@momjian.us

almost 21 years ago

In reply to: Tom Lane (#4)

Re: BUG #1721: mutiple bytes character string comaprison error

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

John Hansen

john@geeknet.com.au

almost 21 years ago

In reply to: Bruce Momjian (#6)

Re: BUG #1721: mutiple bytes character string comaprison error

UTF8 encoding works with the C locale assuming you don't care
about ordering of the character set, e.g. Japanese.

Has anyone with the ability to compile postgresql on windows tested the
ICU patch?

... John

Import Notes

Resolved by subject fallback

Magnus Hagander

magnus@hagander.net

almost 21 years ago

In reply to: John Hansen (#7)

Re: BUG #1721: mutiple bytes character string comaprison error

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

Has anyone with the ability to compile postgresql on windows
tested the ICU patch?

Yes.
See http://archives.postgresql.org/pgsql-hackers/2005-05/msg00662.php

//Magnus

Import Notes

Resolved by subject fallback

Tatsuo Ishii

t-ishii@sra.co.jp

almost 21 years ago

In reply to: Bruce Momjian (#6)

Re: BUG #1721: mutiple bytes character string comaprison

Tom Lane wrote:

Kris Jurka <books@ejurka.com> writes:

On Sun, 19 Jun 2005, Tom Lane wrote:

Sorry, but UTF-8 encoding doesn't work properly on Windows (yet).
Use some other database encoding.

Shouldn't we forbid its creation then?

There was serious discussion of that before the 8.0 release, but
we decided not to forbid it. Check the archives; I don't recall
the reasoning at the moment.

UTF8 encoding works with the C locale assuming you don't care about
ordering of the character set, e.g. Japanese.

No, sometimes Japanese needs char ordering too and I think this is not
a Windows only problem. The real problem is Unicode defines char
orderes in totally random manner because Chinese/Japanese/Korean Kanji
characters are "Unified" in Unicode. To solve the problem, we can use
convert UTF8 to EUC_JP using CONVERT. See archives for more details.

Or you can use Unicode locale only if your platform's locale database
is not broken and you only use single locale.
--
Tatsuo Ishii