UTF-8 and LIKE vs =
Hi All,
I'm having some trouble with multibyte characters and LIKE. We've been
using LIKE instead of = for string queries for a long time, as it gives
us flexibility to use wildcards such as "%" when we need to and get the
same results as with = by not using them. But I've just found that it
sometimes doesn't work properly:
bric=# select version();
version
------------------------------------------------------------------------
---------------------------------
PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)
bric=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)
bric=# select * from keyword where name LIKE '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)
bric=# select * from keyword where name ~ '^북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)
Any idea why = works here and LIKE and ~ wouldn't?
TIA,
David
Attachments:
David Wheeler wrote:
Any idea why = works here and LIKE and ~ wouldn't?
Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:
Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.
Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:
LC_COLLATE
String sort order
LC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)
LC_MESSAGES
Language of messages
LC_MONETARY
Formatting of currency amounts
LC_NUMERIC
Formatting of numbers
LC_TIME
Formatting of dates and times
Is one of these the one I need to set?
Thanks,
David
Attachments:
On Mon, 23 Aug 2004 12:41:30 -0700, David Wheeler <david@kineticode.com> wrote:
(...)
bric=# select version();
version
------------------------------------------------------------------------
---------------------------------
PostgreSQL 7.4.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.2
20030222 (Red Hat Linux 3.2.2-5)
(1 row)bric=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)
er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?
FWIW (on 7.4.3):
test=# select * from t1 ;
id | value
----+--------
1 | 日本
2 | 日本語
3 | 北海道
(3 rows)
test=# select * from t1 where value ~ '日';
id | value
----+--------
1 | 日本
2 | 日本語
(2 rows)
test=# select * from t1 where value like '日%';
id | value
----+--------
1 | 日本
2 | 日本語
test=# select * from t1 where value like '北海%';
id | value
----+--------
3 | 北海道
(1 row)
Ian Barwick
barwick@gmail.net
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?
Yes, it means that = is doing the wrong thing!!
I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??
If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.
TIA,
David
Attachments:
On Mon, 23 Aug 2004 14:04:05 -0700, David Wheeler <david@kineticode.com> wrote:
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?Yes, it means that = is doing the wrong thing!!
I noticed this because I had a query that was looking in the keyword
table for an existing record using LIKE. If it didn't find it, it
inserted it. But the inserts were giving me an error because the name
column has a UNIQUE index on it. Could it be that the index and the =
operator are comparing bytes, and that '국방비' and '북한의' have the same
bytes but different characters??If so, this is a pretty serious problem. How can I get = and the
indices to use character semantics rather than byte semantics? I also
need to be able to store data in different languages in the database
(and in the same column!), but all in Unicode.
I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).
If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.
Ian Barwick
barwick@gmail.net
On Mon, 2004-08-23 at 16:43, David Wheeler wrote:
On Aug 23, 2004, at 1:22 PM, Peter Eisentraut wrote:
Because LIKE does a character-by-character matching and = uses the
operating system locale, which could do anything. If you set the
locale to C, you should get matching results. Which one is "better"
depends on the semantics of the language, which I cannot judge here.Thanks. So I need to set the locale to C and then LIKE will work
properly? How do I go about doing that? I can see these options:LC_COLLATE
String sort orderLC_CTYPE
Character classification (What is a letter? The upper-case equivalent?)LC_MESSAGES
Language of messagesLC_MONETARY
Formatting of currency amountsLC_NUMERIC
Formatting of numbersLC_TIME
Formatting of dates and timesIs one of these the one I need to set?
initdb is your friend. (well, not really, but that's where your headed)
Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Aug 23, 2004, at 2:25 PM, Ian Barwick wrote:
I don't know what the problem is, but you might want to check the
client encoding settings, and the encoding your characters are
arriving in (remembering all the time, in Postgres "UNICODE" really
means UTF-8).If you're using Perl (I'm guessing this is Bricolage-related) the
"_utf8_on"-ness of strings might be worth checking too, and also the
"pg_enable_utf8" flag in DBD::Pg.
Bricolage is getting all its content at UTF-8. It has been working
beautifully for some time. I tried setting the utf8 flag on the
variable passed to the query, but it made no difference.
I think that LIKE is doing the right thing, and = is not. And I need to
find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8
Regards,
David
Attachments:
Import Notes
Reply to msg id not found: 7771649.1093296055158.JavaMail.root@hercules
On Aug 23, 2004, at 2:31 PM, Robert Treat wrote:
initdb is your friend. (well, not really, but that's where your headed)
Yes, that's what I'm beginning to suspect.
Cheers,
David
Attachments:
On Aug 23, 2004, at 2:46 PM, David Wheeler wrote:
I think that LIKE is doing the right thing, and = is not. And I need
to find out how to get = to do the right thing. If I need to dump my
database and run initdb to use C for LC_COLLATE, the, feh, I will.
Right now I have:LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8
And testing on another box with these set to "C", it seems to work
properly.
*sigh* Nothing like dumping and restoring a 2.7 GB database to keep me
engaged in what I'm doing, eh?
Regards,
David
Attachments:
David Wheeler <david@kineticode.com> writes:
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
er, the characters in "name" don't seem to match the characters in the
query - '=B1=B9=B9=E6=BA=F1' vs. '=BA=CF=C7=D1=C0=C7' - does that have an=y bearing?
Yes, it means that = is doing the wrong thing!!
I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
have
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8
you are at risk if the data is not legal UTF-8 strings.
The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter will
choke on.
regards, tom lane
В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?Yes, it means that = is doing the wrong thing!!
The collation rules of your (and my) locale say that these strings are
the same:
[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$
Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.
--
Markus Bertheau <twanger@bluetwanger.de>
On Aug 23, 2004, at 3:44 PM, Tom Lane wrote:
Yes, it means that = is doing the wrong thing!!
I have seen this happen in situations where the strings contained
character sequences that were illegal according to the encoding that
the
locale thought was in force. (It seems that strcoll() will return more
or less random results in such cases...) In particular, given that you
haveLC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8you are at risk if the data is not legal UTF-8 strings.
But is it possible to store non-UTF-8 data in a UNICODE database?
The real question therefore is whether you have the database encoding
set correctly --- ie, is it UNICODE (== UTF8)? If not then it may well
be that Postgres is presenting strings to strcoll() that the latter
will
choke on.
The database is UNICODE.
$ psql -U postgres -l
List of databases
Name | Owner | Encoding
-----------+----------+-----------
bric | postgres | UNICODE
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
(3 rows)
I plan to dump it, run initdb with LC_COLLATE and LC_CTYPE both set to
"C", and restore the database and see if that helps.
Thanks,
David
Attachments:
On Aug 23, 2004, at 3:46 PM, Markus Bertheau wrote:
The collation rules of your (and my) locale say that these strings are
the same:[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$
Interesting.
Make sure that you have initdb'd the database under the right locale.
There's not much PostgreSQL can do if strcoll() says that the strings
are equal.
Well, I have data from a number of different locales in the same
database. I'm hoping that setting the locale to "C" will do the trick.
It seems to work properly on my Mac:
sharky=# select * from keyword where name = '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)
sharky=# select * from keyword where name = '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)
sharky=# select * from keyword where name like '북한의';
id | name | screen_name | sort_name | active
----+------+-------------+-----------+--------
(0 rows)
sharky=# select * from keyword where lower(name) like '국방비';
id | name | screen_name | sort_name | active
----+--------+-------------+-----------+--------
0 | 국방비 | 국방비 | 국방비 | 1
(1 row)
Regards,
David
Attachments:
David Wheeler <david@kineticode.com> writes:
But is it possible to store non-UTF-8 data in a UNICODE database?
In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...
regards, tom lane
On Aug 23, 2004, at 3:59 PM, Tom Lane wrote:
But is it possible to store non-UTF-8 data in a UNICODE database?
In theory not ... but I think there was a discussion earlier that
concluded that our check for encoding validity is not airtight ...
Well, it it was mostly right, I wouldn't expect it to be a problem as
much as this issue is coming up for me. If, OTOH, the encoding validity
check leaks like a sieve, then I might indeed have a bigger problem.
Is the encoding check fixed in 8.0beta1?
Thanks,
David
Attachments:
David Wheeler <david@kineticode.com> writes:
Is the encoding check fixed in 8.0beta1?
[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the direction
that would cause your problem. Time for another theory.
Is the problem query using an index? If so, does REINDEX help?
regards, tom lane
On Aug 23, 2004, at 4:08 PM, Tom Lane wrote:
[ looks back at discussion... ] Actually I misremembered --- the
discussion was about how we would *reject* legal UTF-8 codes that are
more than 2 bytes long. So the code is broken, but not in the
direction
that would cause your problem. Time for another theory.
Whew!
Is the problem query using an index? If so, does REINDEX help?
Doesn't look like it:
bric=# reindex index udx_keyword__name;
REINDEX
bric=# select * from keyword where name ='북한의';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | 국방비 | 국방비 | 국방비 | 1
(1 row)
That's still giving me an invalid row for the value I passed to it
(note that the value of the "name" column is different than the value I
queried for).
Regards,
David
Attachments:
On Tue, 24 Aug 2004 00:46:50 +0200, Markus Bertheau
<twanger@bluetwanger.de> wrote:
В Пнд, 23.08.2004, в 23:04, David Wheeler пишет:
On Aug 23, 2004, at 1:58 PM, Ian Barwick wrote:
er, the characters in "name" don't seem to match the characters in the
query - '국방비' vs. '북한의' - does that have any bearing?Yes, it means that = is doing the wrong thing!!
The collation rules of your (and my) locale say that these strings are
the same:[markus@teetnang markus]$ cat > t
국방비
북한의
[markus@teetnang markus]$ uniq t
국방비
[markus@teetnang markus]$
wild speculation in need of a Korean speaker, but:
ian@linux:~/tmp> cat j.txt
テスト
환경설
전검색
웹문서
국방비
북한의
てすと
ian@linux:~/tmp> uniq j.txt
テスト
환경설
てすと
All but the first and last lines are random Korean (Hangul)
characters. Evidently our respective locales think all Hangul strings
of the same length are identical, which is very probably not the
case...
Ian Barwick
Import Notes
Reply to msg id not found: 9221266.1093301455156.JavaMail.root@hercules
David Wheeler <david@kineticode.com> writes:
Is the problem query using an index? If so, does REINDEX help?
Doesn't look like it:
bric=3D# reindex index udx_keyword__name;
REINDEX
bric=3D# select * from keyword where name =3D'=BA=CF=C7=D1=C0=C7';
id | name | screen_name | sort_name | active
------+--------+-------------+-----------+--------
1218 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1 | =B1=B9=B9=E6=BA=F1=
| 1
(1 row)
Hmm. I tried putting your string into a UNICODE database and I got
ERROR: invalid byte sequence for encoding "UNICODE": 0xc7
So there's something funny happening here. What is your client_encoding
setting?
regards, tom lane