unexpected query behavior with UTF text

Started by Indra Heckenbachover 22 years ago4 messagesgeneral
Jump to latest
#1Indra Heckenbach
indra@macnica.com

I have recently come across an unusual behavior with Postgres 7.3.4 on a
Linux RH 9 system. My database has encoding set to "UNICODE", and the
table includes Japanese text. I'm trying to issue a query like this:

SELECT * FROM sales WHERE name='ja-text';

This query ignores all japanese characters in the comparison text. It
matches properly on ascii chars, but skips right over ja chars.

I tried using "LIKE" instead of "=", and this works fine! I would
expect "=" to do a character by character comparison, but it must not be.

I found a related issue on the mailing list, where locale setting was
causing something similar. However, my locale is set to "en_US.UTF-8",
which is the solution proposed to the other problem.

Is this a bug, or expected behavior? How can I correct this without
resorting to LIKE and escaping wildcard characters?

thanks!

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Indra Heckenbach (#1)
Re: unexpected query behavior with UTF text

Indra Heckenbach <indra@macnica.com> writes:

I have recently come across an unusual behavior with Postgres 7.3.4 on a
Linux RH 9 system. My database has encoding set to "UNICODE", and the
table includes Japanese text. I'm trying to issue a query like this:

SELECT * FROM sales WHERE name='ja-text';

This query ignores all japanese characters in the comparison text. It
matches properly on ascii chars, but skips right over ja chars.

Text = depends on strcoll(), which is locale-sensitive. It sure appears
that your locale is designed to ignore japanese characters :-(

I found a related issue on the mailing list, where locale setting was
causing something similar. However, my locale is set to "en_US.UTF-8",
which is the solution proposed to the other problem.

We have heard before that RH9's default locale setting is seriously
broken. This seems to be additional evidence for that opinion. I'd
recommend re-initdb'ing in locale C.

Also, you say "your locale", but how certain are you that that is the
database's locale, and not just the one in your own user environment?
It'd be a good idea to use pg_controldata to check the database settings.

regards, tom lane

#3Indra Heckenbach
indra@macnica.com
In reply to: Tom Lane (#2)
Re: unexpected query behavior with UTF text

Hi,

Thanks for the response. I actually have tried to re initdb with
locale=C and I got the same results. If my locale is ignoring Ja chars,
how can I change that? I expected that UTF-8 would consider all
characters, being a universal encoding. A colleague also tried the
same test with a different server (locale also set to utf8).

Text = depends on strcoll(), which is locale-sensitive. It sure appears
that your locale is designed to ignore japanese characters :-(

I found a related issue on the mailing list, where locale setting was
causing something similar. However, my locale is set to "en_US.UTF-8",
which is the solution proposed to the other problem.

We have heard before that RH9's default locale setting is seriously
broken. This seems to be additional evidence for that opinion. I'd
recommend re-initdb'ing in locale C.

Also, you say "your locale", but how certain are you that that is the
database's locale, and not just the oe in your own user environment?
It'd be a good idea to use pg_controldata to check the database settings.

I tried to mannually initialize it, and my initdb.i18n exports
LANG="en_US.UTF-8". Is there any other way I can check?

thanks,
Indra

Show quoted text

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

#4Indra Heckenbach
indra@macnica.com
In reply to: Tom Lane (#2)
Re: unexpected query behavior with UTF text

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
Hi Tom,<br>
<br>
I solved the problem by doing<br>
<br>
initdb --locale=ja_JP.utf8<br>
<br>
Unfortunately,<br>
<br>
initdb --locale=en_US.utf8<br>
<br>
does not work. Do you have any idea why?&nbsp; I would think we should be
able to test for equality in any locale.<br>
<br>
thanks,<br>
Indra<br>
<br>
<br>
<br>
Tom Lane wrote:<br>
<blockquote type="cite" cite="mid11056.1066831136@sss.pgh.pa.us">
<pre wrap="">Indra Heckenbach <a class="moz-txt-link-rfc2396E" href="mailto:indra@macnica.com">&lt;indra@macnica.com&gt;</a> writes:
</pre>
<blockquote type="cite">
<pre wrap="">I have recently come across an unusual behavior with Postgres 7.3.4 on a
Linux RH 9 system. My database has encoding set to "UNICODE", and the
table includes Japanese text. I'm trying to issue a query like this:
</pre>
</blockquote>
<pre wrap=""><!---->
</pre>
<blockquote type="cite">
<pre wrap="">SELECT * FROM sales WHERE name='ja-text';
</pre>
</blockquote>
<pre wrap=""><!---->
</pre>
<blockquote type="cite">
<pre wrap="">This query ignores all japanese characters in the comparison text. It
matches properly on ascii chars, but skips right over ja chars.
</pre>
</blockquote>
<pre wrap=""><!---->
Text = depends on strcoll(), which is locale-sensitive. It sure appears
that your locale is designed to ignore japanese characters :-(

</pre>
<blockquote type="cite">
<pre wrap="">I found a related issue on the mailing list, where locale setting was
causing something similar. However, my locale is set to "en_US.UTF-8",
which is the solution proposed to the other problem.
</pre>
</blockquote>
<pre wrap=""><!---->
We have heard before that RH9's default locale setting is seriously
broken. This seems to be additional evidence for that opinion. I'd
recommend re-initdb'ing in locale C.

Also, you say "your locale", but how certain are you that that is the
database's locale, and not just the one in your own user environment?
It'd be a good idea to use pg_controldata to check the database settings.

regards, tom lane

</pre>
</blockquote>
<br>
</body>
</html>