BUG #1268: Two different Unicode chars are treated as equal in a query

Started by PostgreSQL Bugs Listover 21 years ago5 messagesbugs
Jump to latest
#1PostgreSQL Bugs List
pgsql-bugs@postgresql.org

The following bug has been logged online:

Bug reference: 1268
Logged by: Kent Tong

Email address: kent@cpttm.org.mo

PostgreSQL version: 7.4.5

Operating system: RedHat 9

Description: Two different Unicode chars are treated as equal in a
query

Details:

Steps:
1. Create a test database: "createdb -E Unicode -U postgres testdb".
2. Create a test table: "create table testtable (id varchar(100) primary
key);".
3. With JDBC, insert a record whose id contains unicode: "insert into
testtable values(<a unicode char whose code is 0x4e8c>);".
4. With JDBC, try to retrieve a record whose id contains a different unicde:
"select from testtable where id=<a unicode char whose code is 0x4e94>;". It
should not find any record but it finds the record created in step 3.

Here is the JUnit test case:

public class PgSQLTest extends TestCase {
private Connection conn;
protected void setUp() throws Exception {
conn = makeConnection();
}
protected void tearDown() throws Exception {
conn.close();
}
public void testChinese() throws Exception {
deleteAll();
insertRow();
PreparedStatement st =
conn.prepareStatement("select * from testtable where id=?");
try {
st.setString(1, "\u4e94");
ResultSet rs = st.executeQuery();
assertFalse(rs.next());
} finally {
st.close();
}
}

private void insertRow() throws SQLException {
PreparedStatement st =
conn.prepareStatement("insert into testtable values(?)");
st.setString(1, "\u4e8c");
st.executeUpdate();
st.close();
}
private void deleteAll() throws SQLException {
PreparedStatement st = conn.prepareStatement("delete from testtable");
st.executeUpdate();
st.close();
}
private Connection makeConnection()
throws ClassNotFoundException, SQLException {
Class.forName("org.postgresql.Driver");
Properties properties = new Properties();
properties.put("user", "postgres");
properties.put("password", "");
return DriverManager.getConnection(
"jdbc:postgresql://localhost/testdb",
properties);
}
}

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PostgreSQL Bugs List (#1)
Re: BUG #1268: Two different Unicode chars are treated as equal in a query

"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:

Description: Two different Unicode chars are treated as equal in a
query

This would be a matter to take up with the maintainer of your locale
(which you didn't mention, but in any case it's a locale bug). We
just do what strcoll() tells us.

Note that it's possible this is a configuration error and not an
outright bug. Check to make sure that the locale you initdb'd
under is actually designed to work with UTF-8 data.

regards, tom lane

#3Kent Tong
kent@cpttm.org.mo
In reply to: Tom Lane (#2)
Re: BUG #1268: Two different Unicode chars are treated as

Tom Lane wrote:

"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:

Description: Two different Unicode chars are treated as equal in a
query

This would be a matter to take up with the maintainer of your locale
(which you didn't mention, but in any case it's a locale bug). We
just do what strcoll() tells us.

Thanks for the quick reply. The system locale is zh_TW.Big5. However,
I've tried setting it to "C" but the test case still fails.

In order to check if it's a locale bug, I've written a C program:

#include <locale.h>
#include <stdio.h>
#include <string.h>

int main() {
char *s1 = "\xe4\xba\x8c";
char *s2 = "\xe4\xba\x94";
setlocale(LC_ALL, "en.UTF-8");
//setlocale(LC_ALL, "zh.Big5"); //doesn't make any difference
printf("%d\n", strcoll(s1, s2));
return 0;
}

and compiled it and run it on that computer. It prints -1.
It means that strcoll is working.

Note that it's possible this is a configuration error and not an
outright bug. Check to make sure that the locale you initdb'd
under is actually designed to work with UTF-8 data.

Does it matter? The encoding provided to initdb is just
a default for the databases to be created in the future.
When I used createdb, I did specify "-E unicode".

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kent Tong (#3)
Re: BUG #1268: Two different Unicode chars are treated as

Kent Tong <kent@cpttm.org.mo> writes:

Does it matter? The encoding provided to initdb is just
a default for the databases to be created in the future.

Yes it does, and you missed the point. I said *locale*, not *encoding*.
The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
fixed and not alterable without re-initdb. (I agree that this sucks,
but that's how it is for now...)

Your test program doesn't prove a lot unless you are sure it's executing
under the same locale settings as the postmaster is running in.

regards, tom lane

#5Kent Tong
kent@cpttm.org.mo
In reply to: Tom Lane (#4)
Re: BUG #1268: Two different Unicode chars are treated as

Tom Lane wrote:

Yes it does, and you missed the point. I said *locale*, not *encoding*.
The LC_COLLATE and LC_CTYPE settings that prevail during initdb are
fixed and not alterable without re-initdb. (I agree that this sucks,
but that's how it is for now...)

You're right. After using:

initdb --locale zh_TW.utf8 /var/lib/pgsql/data

then it works fine!

Thanks again and sorry about any inconvenience.

--
Kent Tong, Msc, MCSE, SCJP, CCSA, Delphi Certified
Manager of IT Dept, CPTTM
Authorized training for Borland, Cisco, Microsoft, Oracle, RedFlag & RedHat