Multi-byte character bug

Started by Richard Soover 23 years ago6 messagesbugs

richso@i-cable.com

over 23 years ago

Two bugs has been found in the SQL parser and Multibyte char support:

1. “Problem connecting to database: java.sql.SQLException: ERROR:
Invalid EUC_TW character sequence found (0xb27a)” was reported in using
JDBC driver to insert record, similar error reported when using ODBC
driver and psql, since auto-conversion from client to server should
convert the charcter to a valid EUC_TW char, therefore this is a bug

2. inserting record with “餐” chinese char, the SQL parser
report something like ‘Problem connecting to database:
java.sql.SQLException: ERROR: parser: parse error at or near "4567891"’
(similar in jdbc and odbc), and the error “unterminated string” has
been reported when using psql.

I’ve found the problem exists since 7.1.x till 7.2.*.

Tatsuo Ishii

t-ishii@sra.co.jp

over 23 years ago

In reply to: Richard So (#1)

Re: Multi-byte character bug

Two bugs has been found in the SQL parser and Multibyte char support:

1. “Problem connecting to database: java.sql.SQLException: ERROR:
Invalid EUC_TW character sequence found (0xb27a)” was reported in using
JDBC driver to insert record, similar error reported when using ODBC
driver and psql, since auto-conversion from client to server should
convert the charcter to a valid EUC_TW char, therefore this is a bug

How did you set the auto-conversion settings for psql? I suspect you
did something wrong with it.

2. inserting record with “餐” chinese char, the SQL parser
report something like ‘Problem connecting to database:
java.sql.SQLException: ERROR: parser: parse error at or near "4567891"’
(similar in jdbc and odbc), and the error “unterminated string” has
been reported when using psql.

I’ve found the problem exists since 7.1.x till 7.2.*.

What is the encoding for "chinese char"? You need to give us more
info.
--
Tatsuo Ishii

P.S. Please don't post with non-ascii chars. If you need to show
non-ascii chars, you should give them in a hex form.

Richard So

richso@i-cable.com

over 23 years ago

In reply to: Tatsuo Ishii (#2)

Re: Multi-byte character bug

Two bugs has been found in the SQL parser and Multibyte char support:

What is the encoding for "chinese char"? You need to give us more
info.

By Chinese here, I mean BIG5 encoding character which is a widely used
encoding in HK and Taiwan.
My setup:
Db encoding: EUC_TW
Client (JDBC / ODBC) Encoding: BIG5
JDBC: I supplied the parameter 'charSet=Big5' to the
connection string
ODBC: my locale (Chinese Win2000 machine) is Chinese
Taiwan
Client application: Tomcat4 jsp page (see the attached)
App / Db Server: Redhat 7.3 Linux + postgresql (set) 7.2.1-2PGDG
(download binary rpm) + Tomcat4
App / DB Server locale: zh_TW.Big5
JDBC driver: pgjdbc2.jar
Client Machine: Win2000 Chinese (Taiwan) Version with SP2 + I.E.
(jsp) + Delphi SQL Explorer (ODBC)
Client Machine locale: Chinese (Taiwan)

1. 'Problem connecting to database: java.sql.SQLException:

ERROR:

Invalid EUC_TW character sequence found (0xb27a)' was reported in

using

JDBC driver to insert record, similar error reported when using ODBC
driver and psql, since auto-conversion from client to server should
convert the charcter to a valid EUC_TW char, therefore this is a bug

How did you set the auto-conversion settings for psql? I suspect you
did something wrong with it.

I've done a new check on it, I found JDBC and ODBC driver still report
the error message but psql do not (may be as you said, I've done a wrong
procedure). However, the problem still there: why JDBC and ODBC still
report the error ?
I just tried some Chinese words, but there may be some of other
character will also cause the problem.
I know Tomcat4 default will return the request parameters in ISO-8859
and therefore I've added code
<%@ page contentType="text/html; charset=Big5"%>
<%
request.setCharacterEncoding("BIG5");
%>
to the JSP page and dump the actual SQL posted to postgresql server to
make sure the SQL is correct and its attached (pls see attached file:
offence1.zip).

2. inserting record with xx ¨ chinese char, the SQL parser
report something like 'Problem connecting to database:
java.sql.SQLException: ERROR: parser: parse error at or near

"4567891"'

(similar in jdbc and odbc), and the error 'unterminated string' has
been reported when using psql.

The character code is 0xc05c, in which the second byte is actually a "\"
(back-slash)
(pls see the attached file: offence2.zip)

Show quoted text

I¡¦ve found the problem exists since 7.1.x till 7.2.*.

Import Notes

Resolved by subject fallback

Tatsuo Ishii

t-ishii@sra.co.jp

over 23 years ago

In reply to: Richard So (#3)

Re: Multi-byte character bug

By Chinese here, I mean BIG5 encoding character which is a widely used
encoding in HK and Taiwan.

Ok. PostgreSQL does support BIG5 in the *frontend* side.

I've done a new check on it, I found JDBC and ODBC driver still report
the error message but psql do not (may be as you said, I've done a wrong
procedure). However, the problem still there: why JDBC and ODBC still
report the error ?

psql works but JDBC and ODBC does not? The fact that psql is working
tell us that at least BIG5<-->EUC_TW works fine. It seems something
wrong with JDBC and ODBC settings. Unfortunately I'm not a Java or
ODBC expert at all. Sorry...

The character code is 0xc05c, in which the second byte is actually a "\"
(back-slash)
(pls see the attached file: offence2.zip)

There's no character code in EUC_TW (CNS 11643-1992) corresponding to
Big5 0xc05c. That's why PostgreSQL complains.
--
Tatsuo Ishii

Richard So

richso@i-cable.com

over 23 years ago

In reply to: Tatsuo Ishii (#4)

Re: Multi-byte character bug

-----Original Message-----
From: pgsql-bugs-owner@postgresql.org
[mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Tatsuo Ishii
Sent: Wednesday, July 31, 2002 1:18 PM
To: richso@i-cable.com
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] Multi-byte character bug

By Chinese here, I mean BIG5 encoding character which is a

widely used

encoding in HK and Taiwan.

Ok. PostgreSQL does support BIG5 in the *frontend* side.

I've done a new check on it, I found JDBC and ODBC driver

still report

the error message but psql do not (may be as you said, I've done a
wrong procedure). However, the problem still there: why

JDBC and ODBC

still report the error ?

psql works but JDBC and ODBC does not? The fact that psql is
working tell us that at least BIG5<-->EUC_TW works fine. It
seems something wrong with JDBC and ODBC settings.
Unfortunately I'm not a Java or ODBC expert at all. Sorry...

Ok ! I will post to the jdbc and odbc thread for help !

The character code is 0xc05c, in which the second byte is

actually a

"\"
(back-slash)
(pls see the attached file: offence2.zip)

There's no character code in EUC_TW (CNS 11643-1992)
corresponding to Big5 0xc05c. That's why PostgreSQL complains.

But I've created another db using MULE_INTERNAL encoding, the same error
reported, why ?
Why don't Postgres directly support BIG5 in server side as BIG5 is the
main encoding using for Traditional Chinese communities, i.e. HK &
Taiwan ? As EUC_TW do not have complete correspondings char in BIG5,
this will seriously prevent the Traditional Chinese communities for
using Postgresql !

--
Tatsuo Ishii

---------------------------(end of
broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

Tatsuo Ishii

t-ishii@sra.co.jp

over 23 years ago

In reply to: Richard So (#5)

Re: Multi-byte character bug

There's no character code in EUC_TW (CNS 11643-1992)
corresponding to Big5 0xc05c. That's why PostgreSQL complains.

But I've created another db using MULE_INTERNAL encoding, the same error
reported, why ?

Since Big5 representation of MULE_INTERNAL is actually "leading
character"+EUC_TW. i.e.

Why don't Postgres directly support BIG5 in server side

It's because of pury technical reason. Handling those encodings
containing bytes < 0x80 in second (or third) byte of a word confuses
our SQL parser. I think it's not impossible for the parser to handle
Big5, but if we make such a change, the parser would not be able to
other encodings. If you have a good idea to overcome these problems,
we are wellcome.

as BIG5 is the
main encoding using for Traditional Chinese communities, i.e. HK &
Taiwan ? As EUC_TW do not have complete correspondings char in BIG5,
this will seriously prevent the Traditional Chinese communities for
using Postgresql !

Just a curious. Why do people living in those area prefer Big5 over
EUC_TW? I thought EUC_TW (or CNS 11643-1992) was defined by the
goverment in Taiwan. Is there any technical superiority in Big5?
Or maybe "don't know why but just many peole use Big5":-)
--
Tatsuo Ishii

Multi-byte character bug

Attachments: