UTF-8 and LIKE vs =

peter_e@gmx.net

almost 22 years ago

In reply to: David E. Wheeler (#1)

Re: UTF-8 and LIKE vs =

David Wheeler wrote:

Any idea why = works here and LIKE and ~ wouldn't?

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

david@kineticode.com

almost 22 years ago

In reply to: Peter Eisentraut (#2)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:

LC_COLLATE
String sort order

LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)

LC_MESSAGES
Language of messages

LC_MONETARY
Formatting of currency amounts

LC_NUMERIC
Formatting of numbers

LC_TIME
Formatting of dates and times

Is one of these the one I need to set?

Thanks,

David

barwick@gmail.com

almost 22 years ago

In reply to: David E. Wheeler (#1)

Re: UTF-8 and LIKE vs =

On Mon, 23 Aug 2004 12:41:30 -0700, David Wheeler <david@kineticode.com> wrote:
(...)

bric=# select version();
version
------------------------------------------------------------------------
---------------------------------
PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)

bric=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

FWIW (on 7.4.3):

test=# select * from t1 ;
id | value
----+--------
1 | 日本
2 | 日本語
3 | 北海道
(3 rows)

test=# select * from t1 where value ~ '日';
id | value
----+--------
1 | 日本
2 | 日本語
(2 rows)

test=# select * from t1 where value like '日%';
id | value
----+--------
1 | 日本
2 | 日本語

test=# select * from t1 where value like '北海%';
id | value
----+--------
3 | 北海道
(1 row)

Ian Barwick
barwick@gmail.net

david@kineticode.com

almost 22 years ago

In reply to: Ian Lawrence Barwick (#4)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??

If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.

TIA,

David

barwick@gmail.com

almost 22 years ago

In reply to: David E. Wheeler (#5)

Re: UTF-8 and LIKE vs =

On Mon, 23 Aug 2004 14:04:05 -0700, David Wheeler <david@kineticode.com> wrote:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??

If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.

I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).

If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.

Ian Barwick
barwick@gmail.net

Robert Treat

xzilla@users.sourceforge.net

almost 22 years ago

In reply to: David E. Wheeler (#3)

Re: UTF-8 and LIKE vs =

On Mon, 2004-08-23 at 16:43, David Wheeler wrote:

On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:

LC_COLLATE
String sort order

LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)

LC_MESSAGES
Language of messages

LC_MONETARY
Formatting of currency amounts

LC_NUMERIC
Formatting of numbers

LC_TIME
Formatting of dates and times

Is one of these the one I need to set?

initdb is your friend. (well, not really, but that's where your headed)

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

david@kineticode.com

almost 22 years ago

In reply to: David E. Wheeler (#1)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:25 PM, Ian Barwick wrote:

I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).

If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.

Bricolage is getting all its content at UTF-8. It has been working
beautifully for some time. I tried setting the utf8 flag on the
variable passed to the query, but it made no difference.

I think that LIKE is doing the right thing, and = is not. And I need to
find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

Regards,

David

Import Notes

Reply to msg id not found: 7771649.1093296055158.JavaMail.root@hercules

david@kineticode.com

almost 22 years ago

In reply to: Robert Treat (#7)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:31 PM, Robert Treat wrote:

initdb is your friend. (well, not really, but that's where your headed)

Yes, that's what I'm beginning to suspect.

Cheers,

David

#10

david@kineticode.com

almost 22 years ago

In reply to: David E. Wheeler (#8)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:46 PM, David Wheeler wrote:

I think that LIKE is doing the right thing, and = is not. And I need
to find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

And testing on another box with these set to "C", it seems to work
properly.

*sigh* Nothing like dumping and restoring a 2.7 GB database to keep me
engaged in what I'm doing, eh?

Regards,

David

#11

tgl@sss.pgh.pa.us

almost 22 years ago

In reply to: David E. Wheeler (#5)

Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '=B1=B9=B9=E6=BA=F1' vs. '=BA=CF=C7=D1=C0=C7' - does that have an=

y bearing?

Yes, it means that = is doing the wrong thing!!

I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
have

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

you are at risk if the data is not legal UTF-8 strings.

The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter will
choke on.

regards, tom lane

#12

Markus Bertheau

twanger@bluetwanger.de

almost 22 years ago

In reply to: David E. Wheeler (#5)

Re: UTF-8 and LIKE vs =

В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.

--
Markus Bertheau <twanger@bluetwanger.de>

#13

david@kineticode.com

almost 22 years ago

In reply to: Tom Lane (#11)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:44 PM, Tom Lane wrote:

Yes, it means that = is doing the wrong thing!!

I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that
the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
have

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

you are at risk if the data is not legal UTF-8 strings.

But is it possible to store non-UTF-8 data in a UNICODE database?

The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter
will
choke on.

The database is UNICODE.

I plan to dump it, run initdb with LC_COLLATE and LC_CTYPE both set to
"C", and restore the database and see if that helps.

Thanks,

David

#14

david@kineticode.com

almost 22 years ago

In reply to: Markus Bertheau (#12)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote:

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

Interesting.

Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.

Well, I have data from a number of different locales in the same
database. I'm hoping that setting the locale to "C" will do the trick.
It seems to work properly on my Mac:

sharky=# select * from keyword where name = '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)

sharky=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where name like '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where lower(name) like '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)

Regards,

David

#15

tgl@sss.pgh.pa.us

almost 22 years ago

In reply to: David E. Wheeler (#13)

Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

But is it possible to store non-UTF-8 data in a UNICODE database?

In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...

regards, tom lane

#16

david@kineticode.com

almost 22 years ago

In reply to: Tom Lane (#15)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:59 PM, Tom Lane wrote:

But is it possible to store non-UTF-8 data in a UNICODE database?

In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...

Well, it it was mostly right, I wouldn't expect it to be a problem as
much as this issue is coming up for me. If, OTOH, the encoding validity
check leaks like a sieve, then I might indeed have a bigger problem.

Is the encoding check fixed in 8.0beta1?

Thanks,

David

#17

tgl@sss.pgh.pa.us

almost 22 years ago

In reply to: David E. Wheeler (#16)

Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

Is the encoding check fixed in 8.0beta1?

[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the direction
that would cause your problem. Time for another theory.

Is the problem query using an index? If so, does REINDEX help?

regards, tom lane

#18

david@kineticode.com

almost 22 years ago

In reply to: Tom Lane (#17)

Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 4:08 PM, Tom Lane wrote:

[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the
direction
that would cause your problem. Time for another theory.

Whew!

Is the problem query using an index? If so, does REINDEX help?

Doesn't look like it:

bric=# reindex index udx_keyword__name;
REINDEX
bric=# select * from keyword where name ='북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)

That's still giving me an invalid row for the value I passed to it
(note that the value of the "name" column is different than the value I
queried for).

Regards,

David

#19

barwick@gmail.com

almost 22 years ago

In reply to: David E. Wheeler (#1)

Re: UTF-8 and LIKE vs =

On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau
<twanger@bluetwanger.de> wrote:

В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

wild speculation in need of a Korean speaker, but:

ian@linux:~/tmp> cat j.txt
テスト
환경설
전검색
웹문서
국방비
북한의
てすと
ian@linux:~/tmp> uniq j.txt
テスト
환경설
てすと

All but the first and last lines are random Korean (Hangul)
characters. Evidently our respective locales think all Hangul strings
of the same length are identical, which is very probably not the
case...

Ian Barwick

Import Notes

Reply to msg id not found: 9221266.1093301455156.JavaMail.root@hercules

#20

tgl@sss.pgh.pa.us

almost 22 years ago

In reply to: David E. Wheeler (#18)

Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

Is the problem query using an index? If so, does REINDEX help?

Doesn't look like it:

bric=3D# reindex index udx_keyword__name;
REINDEX
bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1=
| 1
(1 row)

Hmm. I tried putting your string into a UNICODE database and I got
ERROR: invalid byte sequence for encoding "UNICODE": 0xc7

So there's something funny happening here. What is your client_encoding
setting?

regards, tom lane

#21

david@kineticode.com

almost 22 years ago

In reply to: Tom Lane (#20)

#22

david@kineticode.com

almost 22 years ago

In reply to: Ian Lawrence Barwick (#19)

#23

david@kineticode.com

almost 22 years ago

In reply to: David E. Wheeler (#21)

#24

barwick@gmail.com

almost 22 years ago

In reply to: David E. Wheeler (#22)

#25

david@kineticode.com

almost 22 years ago

In reply to: Ian Lawrence Barwick (#24)

#26

ishii@postgresql.org

almost 22 years ago

In reply to: Ian Lawrence Barwick (#19)

#27

david@kineticode.com

almost 22 years ago

In reply to: David E. Wheeler (#5)

#28

Tim Allen

tim@proximity.com.au

almost 22 years ago

In reply to: Tom Lane (#20)

#29

david@kineticode.com

almost 22 years ago

In reply to: Tim Allen (#28)

#30

rees@ddcom.co.jp

almost 22 years ago

In reply to: Ian Lawrence Barwick (#19)

#31

tgl@sss.pgh.pa.us

almost 22 years ago

In reply to: David E. Wheeler (#23)

#32

david@kineticode.com

almost 22 years ago

In reply to: Tom Lane (#31)

#33

david@kineticode.com

almost 22 years ago

In reply to: Ian Lawrence Barwick (#19)

#34

rees@ddcom.co.jp

almost 22 years ago

In reply to: David E. Wheeler (#33)

#35

ishii@postgresql.org

almost 22 years ago

In reply to: Joel (#34)

#36

rees@ddcom.co.jp

almost 22 years ago

In reply to: Tatsuo Ishii (#35)

#37

ishii@postgresql.org

almost 22 years ago

In reply to: Joel (#36)

#38

ishii@postgresql.org

almost 22 years ago

In reply to: Ian Lawrence Barwick (#19)

#39

peter_e@gmx.net

almost 22 years ago

In reply to: David E. Wheeler (#27)

#40

david@kineticode.com

almost 22 years ago

In reply to: Joel (#34)

#41

david@kineticode.com

almost 22 years ago

In reply to: Tatsuo Ishii (#35)

#42

david@kineticode.com

almost 22 years ago

In reply to: Peter Eisentraut (#39)

#43

peter_e@gmx.net

almost 22 years ago

In reply to: David E. Wheeler (#42)

#44

david@kineticode.com

almost 22 years ago

In reply to: Peter Eisentraut (#43)

#45

peter_e@gmx.net

almost 22 years ago

In reply to: David E. Wheeler (#44)

#46

david@kineticode.com

almost 22 years ago

In reply to: Peter Eisentraut (#45)

#47

ishii@postgresql.org

almost 22 years ago

In reply to: Peter Eisentraut (#45)

#48

rees@ddcom.co.jp

almost 22 years ago

In reply to: David E. Wheeler (#40)

#49

rees@ddcom.co.jp

almost 22 years ago

In reply to: Peter Eisentraut (#45)

#50

rees@ddcom.co.jp

almost 22 years ago

In reply to: Tatsuo Ishii (#37)

#51

ishii@postgresql.org

almost 22 years ago

In reply to: Joel (#48)

#52

Lincoln Yeoh

lyeoh@pop.jaring.my

almost 22 years ago

In reply to: Peter Eisentraut (#45)

#53

david@kineticode.com

almost 22 years ago

In reply to: Joel (#48)

#54

david@kineticode.com

almost 22 years ago

In reply to: Tatsuo Ishii (#51)

#55

david@kineticode.com

almost 22 years ago

In reply to: Lincoln Yeoh (#52)

#56

Lincoln Yeoh

lyeoh@pop.jaring.my

almost 22 years ago

In reply to: David E. Wheeler (#55)

#57

rees@ddcom.co.jp

almost 22 years ago

In reply to: Joel (#48)

#58

david@kineticode.com

almost 22 years ago

In reply to: Joel (#57)

#59

rees@ddcom.co.jp

almost 22 years ago

In reply to: David E. Wheeler (#58)

#60

david@kineticode.com

almost 22 years ago

In reply to: Joel (#57)

#61

Michael Glaesemann

grzm@seespotcode.net

almost 22 years ago

In reply to: David E. Wheeler (#60)

#62

david@kineticode.com

almost 22 years ago

In reply to: Michael Glaesemann (#61)

#63