UTF-8 and LIKE vs =

Started by David E. Wheelerover 21 years ago63 messagesgeneral
Jump to latest
#1David E. Wheeler
david@kineticode.com

Hi All,

I'm having some trouble with multibyte characters and LIKE. We've been
using LIKE instead of = for string queries for a long time, as it gives
us flexibility to use wildcards such as "%" when we need to and get the
same results as with = by not using them. But I've just found that it
sometimes doesn't work properly:

bric=# select version();
version
------------------------------------------------------------------------
---------------------------------
PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)

bric=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)

bric=# select * from keyword where name LIKE '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

bric=# select * from keyword where name ~ '^북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

Any idea why = works here and LIKE and ~ wouldn't?

TIA,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#2Peter Eisentraut
peter_e@gmx.net
In reply to: David E. Wheeler (#1)
Re: UTF-8 and LIKE vs =

David Wheeler wrote:

Any idea why = works here and LIKE and ~ wouldn't?

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#3David E. Wheeler
david@kineticode.com
In reply to: Peter Eisentraut (#2)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:

LC_COLLATE
String sort order

LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)

LC_MESSAGES
Language of messages

LC_MONETARY
Formatting of currency amounts

LC_NUMERIC
Formatting of numbers

LC_TIME
Formatting of dates and times

Is one of these the one I need to set?

Thanks,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#4Ian Lawrence Barwick
barwick@gmail.com
In reply to: David E. Wheeler (#1)
Re: UTF-8 and LIKE vs =

On Mon, 23 Aug 2004 12:41:30 -0700, David Wheeler <david@kineticode.com> wrote:
(...)

bric=# select version();
version
------------------------------------------------------------------------
---------------------------------
PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)

bric=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

FWIW (on 7.4.3):

test=# select * from t1 ;
id | value
----+--------
1 | 日本
2 | 日本語
3 | 北海道
(3 rows)

test=# select * from t1 where value ~ '日';
id | value
----+--------
1 | 日本
2 | 日本語
(2 rows)

test=# select * from t1 where value like '日%';
id | value
----+--------
1 | 日本
2 | 日本語

test=# select * from t1 where value like '北海%';
id | value
----+--------
3 | 北海道
(1 row)

Ian Barwick
barwick@gmail.net

#5David E. Wheeler
david@kineticode.com
In reply to: Ian Lawrence Barwick (#4)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??

If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.

TIA,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#6Ian Lawrence Barwick
barwick@gmail.com
In reply to: David E. Wheeler (#5)
Re: UTF-8 and LIKE vs =

On Mon, 23 Aug 2004 14:04:05 -0700, David Wheeler <david@kineticode.com> wrote:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??

If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.

I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).

If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.

Ian Barwick
barwick@gmail.net

#7Robert Treat
xzilla@users.sourceforge.net
In reply to: David E. Wheeler (#3)
Re: UTF-8 and LIKE vs =

On Mon, 2004-08-23 at 16:43, David Wheeler wrote:

On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:

Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.

Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:

LC_COLLATE
String sort order

LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)

LC_MESSAGES
Language of messages

LC_MONETARY
Formatting of currency amounts

LC_NUMERIC
Formatting of numbers

LC_TIME
Formatting of dates and times

Is one of these the one I need to set?

initdb is your friend. (well, not really, but that's where your headed)

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

#8David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#1)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:25 PM, Ian Barwick wrote:

I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).

If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.

Bricolage is getting all its content at UTF-8. It has been working
beautifully for some time. I tried setting the utf8 flag on the
variable passed to the query, but it made no difference.

I think that LIKE is doing the right thing, and = is not. And I need to
find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

Regards,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#9David E. Wheeler
david@kineticode.com
In reply to: Robert Treat (#7)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:31 PM, Robert Treat wrote:

initdb is your friend. (well, not really, but that's where your headed)

Yes, that's what I'm beginning to suspect.

Cheers,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#10David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#8)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 2:46 PM, David Wheeler wrote:

I think that LIKE is doing the right thing, and = is not. And I need
to find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

And testing on another box with these set to "C", it seems to work
properly.

*sigh* Nothing like dumping and restoring a 2.7 GB database to keep me
engaged in what I'm doing, eh?

Regards,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#5)
Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '=B1=B9=B9=E6=BA=F1' vs. '=BA=CF=C7=D1=C0=C7' - does that have an=

y bearing?

Yes, it means that = is doing the wrong thing!!

I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
have

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

you are at risk if the data is not legal UTF-8 strings.

The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter will
choke on.

regards, tom lane

#12Markus Bertheau
twanger@bluetwanger.de
In reply to: David E. Wheeler (#5)
Re: UTF-8 and LIKE vs =

В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.

--
Markus Bertheau <twanger@bluetwanger.de>

#13David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#11)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:44 PM, Tom Lane wrote:

Yes, it means that = is doing the wrong thing!!

I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that
the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
have

LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

you are at risk if the data is not legal UTF-8 strings.

But is it possible to store non-UTF-8 data in a UNICODE database?

The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter
will
choke on.

The database is UNICODE.

$ psql -U postgres -l
List of databases
Name | Owner | Encoding
-----------+----------+-----------
bric | postgres | UNICODE
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
(3 rows)

I plan to dump it, run initdb with LC_COLLATE and LC_CTYPE both set to
"C", and restore the database and see if that helps.

Thanks,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#14David E. Wheeler
david@kineticode.com
In reply to: Markus Bertheau (#12)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote:

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

Interesting.

Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.

Well, I have data from a number of different locales in the same
database. I'm hoping that setting the locale to "C" will do the trick.
It seems to work properly on my Mac:

sharky=# select * from keyword where name = '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)

sharky=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where name like '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)

sharky=# select * from keyword where lower(name) like '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)

Regards,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#13)
Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

But is it possible to store non-UTF-8 data in a UNICODE database?

In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...

regards, tom lane

#16David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#15)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 3:59 PM, Tom Lane wrote:

But is it possible to store non-UTF-8 data in a UNICODE database?

In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...

Well, it it was mostly right, I wouldn't expect it to be a problem as
much as this issue is coming up for me. If, OTOH, the encoding validity
check leaks like a sieve, then I might indeed have a bigger problem.

Is the encoding check fixed in 8.0beta1?

Thanks,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#16)
Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

Is the encoding check fixed in 8.0beta1?

[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the direction
that would cause your problem. Time for another theory.

Is the problem query using an index? If so, does REINDEX help?

regards, tom lane

#18David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#17)
Re: UTF-8 and LIKE vs =

On Aug 23, 2004, at 4:08 PM, Tom Lane wrote:

[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the
direction
that would cause your problem. Time for another theory.

Whew!

Is the problem query using an index? If so, does REINDEX help?

Doesn't look like it:

bric=# reindex index udx_keyword__name;
REINDEX
bric=# select * from keyword where name ='북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)

That's still giving me an invalid row for the value I passed to it
(note that the value of the "name" column is different than the value I
queried for).

Regards,

David

Attachments:

smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#19Ian Lawrence Barwick
barwick@gmail.com
In reply to: David E. Wheeler (#1)
Re: UTF-8 and LIKE vs =

On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau
<twanger@bluetwanger.de> wrote:

В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:

On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:

er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?

Yes, it means that = is doing the wrong thing!!

The collation rules of your (and my) locale say that these strings are
the same:

[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$

wild speculation in need of a Korean speaker, but:

ian@linux:~/tmp> cat j.txt
テスト
환경설
전검색
웹문서
국방비
북한의
てすと
ian@linux:~/tmp> uniq j.txt
テスト
환경설
てすと

All but the first and last lines are random Korean (Hangul)
characters. Evidently our respective locales think all Hangul strings
of the same length are identical, which is very probably not the
case...

Ian Barwick

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#18)
Re: UTF-8 and LIKE vs =

David Wheeler <david@kineticode.com> writes:

Is the problem query using an index? If so, does REINDEX help?

Doesn't look like it:

bric=3D# reindex index udx_keyword__name;
REINDEX
bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1=
| 1
(1 row)

Hmm. I tried putting your string into a UNICODE database and I got
ERROR: invalid byte sequence for encoding "UNICODE": 0xc7

So there's something funny happening here. What is your client_encoding
setting?

regards, tom lane

#21David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#20)
#22David E. Wheeler
david@kineticode.com
In reply to: Ian Lawrence Barwick (#19)
#23David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#21)
#24Ian Lawrence Barwick
barwick@gmail.com
In reply to: David E. Wheeler (#22)
#25David E. Wheeler
david@kineticode.com
In reply to: Ian Lawrence Barwick (#24)
#26Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Ian Lawrence Barwick (#19)
#27David E. Wheeler
david@kineticode.com
In reply to: David E. Wheeler (#5)
#28Tim Allen
tim@proximity.com.au
In reply to: Tom Lane (#20)
#29David E. Wheeler
david@kineticode.com
In reply to: Tim Allen (#28)
#30Joel
rees@ddcom.co.jp
In reply to: Ian Lawrence Barwick (#19)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: David E. Wheeler (#23)
#32David E. Wheeler
david@kineticode.com
In reply to: Tom Lane (#31)
#33David E. Wheeler
david@kineticode.com
In reply to: Ian Lawrence Barwick (#19)
#34Joel
rees@ddcom.co.jp
In reply to: David E. Wheeler (#33)
#35Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Joel (#34)
#36Joel
rees@ddcom.co.jp
In reply to: Tatsuo Ishii (#35)
#37Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Joel (#36)
#38Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Ian Lawrence Barwick (#19)
#39Peter Eisentraut
peter_e@gmx.net
In reply to: David E. Wheeler (#27)
#40David E. Wheeler
david@kineticode.com
In reply to: Joel (#34)
#41David E. Wheeler
david@kineticode.com
In reply to: Tatsuo Ishii (#35)
#42David E. Wheeler
david@kineticode.com
In reply to: Peter Eisentraut (#39)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: David E. Wheeler (#42)
#44David E. Wheeler
david@kineticode.com
In reply to: Peter Eisentraut (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: David E. Wheeler (#44)
#46David E. Wheeler
david@kineticode.com
In reply to: Peter Eisentraut (#45)
#47Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#45)
#48Joel
rees@ddcom.co.jp
In reply to: David E. Wheeler (#40)
#49Joel
rees@ddcom.co.jp
In reply to: Peter Eisentraut (#45)
#50Joel
rees@ddcom.co.jp
In reply to: Tatsuo Ishii (#37)
#51Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Joel (#48)
#52Lincoln Yeoh
lyeoh@pop.jaring.my
In reply to: Peter Eisentraut (#45)
#53David E. Wheeler
david@kineticode.com
In reply to: Joel (#48)
#54David E. Wheeler
david@kineticode.com
In reply to: Tatsuo Ishii (#51)
#55David E. Wheeler
david@kineticode.com
In reply to: Lincoln Yeoh (#52)
#56Lincoln Yeoh
lyeoh@pop.jaring.my
In reply to: David E. Wheeler (#55)
#57Joel
rees@ddcom.co.jp
In reply to: Joel (#48)
#58David E. Wheeler
david@kineticode.com
In reply to: Joel (#57)
#59Joel
rees@ddcom.co.jp
In reply to: David E. Wheeler (#58)
#60David E. Wheeler
david@kineticode.com
In reply to: Joel (#57)
#61Michael Glaesemann
grzm@seespotcode.net
In reply to: David E. Wheeler (#60)
#62David E. Wheeler
david@kineticode.com
In reply to: Michael Glaesemann (#61)
#63Joel
rees@ddcom.co.jp
In reply to: David E. Wheeler (#62)