Invalid EUC_JP char seq bug?

Started by Jean-Christian Imbeaultalmost 23 years ago4 messagesbugs
Jump to latest
#1Jean-Christian Imbeault
jc@mega-bucks.co.jp

I am using PHP with postgreSQL and I have been getting a few rare errors
while trying to do selects on a table containing EUC_JP text.

I thought it was a bug with PHP not recognizing a string as invalid
EUC_JP characters and wrote up a bug report but the PHP developers
assure me that the string that is generating the error is a valid EUC_JP
string (I don't know anything about character encodings so I am taking
them at their word and the fact that the string displays fine in my
browser as EUC_JP lends me to suspect they might be right).

The offending string is url encoded as such:

words=%8f%ac%90%ec%96%be%93%fa%8d%81

When I try and do a SELECT I get the following error:

select id from products where name like '??????'
Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100)

(Where did the 0x00 come from??)

Can someone let me know if this truly is a bug in postgres?

Thanks,

Jean-Christian Imbeault

PS I have also had the error pop up with this string:

search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Jean-Christian Imbeault (#1)
Re: Invalid EUC_JP char seq bug?

I am using PHP with postgreSQL and I have been getting a few rare errors
while trying to do selects on a table containing EUC_JP text.

I thought it was a bug with PHP not recognizing a string as invalid
EUC_JP characters and wrote up a bug report but the PHP developers
assure me that the string that is generating the error is a valid EUC_JP
string (I don't know anything about character encodings so I am taking
them at their word and the fact that the string displays fine in my
browser as EUC_JP lends me to suspect they might be right).

The offending string is url encoded as such:

words=%8f%ac%90%ec%96%be%93%fa%8d%81

When I try and do a SELECT I get the following error:

select id from products where name like '??????'
Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100)

Since you did not show us exact query you send to PostgreSQL, I assume
the query passed to PostgreSQL is:

select id from products where name like 'string';

where string is "0x8fac90ec96be93fa8d81".

If the string is supposed to be an EUC_JP, it would be parsed as follows:

8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character)
ac90: a JIS 0212 character
ec96: a JIS 0208 character
be93: a JIS 0208 character
fa8d: a JIS 0208 character
81: ???

The last 0x81 is invalid if the string is assumed as EUC_JP.

(Where did the 0x00 come from??)

trailing '\0'.

Can someone let me know if this truly is a bug in postgres?

No.

Thanks,

Jean-Christian Imbeault

PS I have also had the error pop up with this string:

search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

This is definitly a bad EUC_JP.
--
Tatsuo Ishii

#3Jean-Christian Imbeault
jc@mega-bucks.co.jp
In reply to: Jean-Christian Imbeault (#1)
Re: Invalid EUC_JP char seq bug?

Tatsuo Ishii wrote:

Since you did not show us exact query you send to PostgreSQL

I can't show the exact query because it is generated by PHP. I can
however show you the code that generates the query:

$words = $_GET["words"];
$sql = "select id from products where name like '$words'";
$conn = pg_connect("host=$DB_IP port=5432 dbname=$DB_NAME user=postgres");
$res = pg_query($conn, $sql);

The GET query string was:

words=%8f%ac%90%ec%96%be%93%fa%8d%81

I think that PHP does some internal translation of this before passing
it on though.

I assume the query passed to PostgreSQL is:

select id from products where name like 'string';

Yes.

where string is "0x8fac90ec96be93fa8d81".

That I don't know.

If the string is supposed to be an EUC_JP, it would be parsed as follows:

8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character

[snip ...]

Ah ... so it is not an EUC-JP string but an SJIS string. Postgres was
right. That answers my question. Thanks!

PS I have also had the error pop up with this string:

search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

This is definitly a bad EUC_JP.

According to a PHP developer in my bug report
(http://bugs.php.net/bug.php?id=24309&edit=2):

"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
correct EUC-JP byte sequence."

I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
bug report to PHP and pointed this out. Hopefully the developer will see
that this sequence is incorrect EUC-JP and that PHP failed to detect this :)

I *knew* there was nothing wrong with Postgres ;)

Thanks!

Jean-Christian Imbeault

PS I posted to HACKERS a few weeks ago about another bug (a real one :)
in the EUC-JP translation having to do with the WAVE DASH. I'll repost
here on the BUGS list, could you let me know the status of that BUG? Thanks!

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Jean-Christian Imbeault (#3)
Re: Invalid EUC_JP char seq bug?

search_words=%B7%F6%BA%7E
select id from products where name like '??~'
Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

This is definitly a bad EUC_JP.

According to a PHP developer in my bug report
(http://bugs.php.net/bug.php?id=24309&edit=2):

"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
correct EUC-JP byte sequence."

I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
bug report to PHP and pointed this out. Hopefully the developer will see
that this sequence is incorrect EUC-JP and that PHP failed to detect this :)

In the EUC_JP encoding there are some rules:

1) if the first byte is 0x8e then second byte is a JIS 0201 character
and should be greater than 0x7f

2) else if the first byte is 0x8f then second and third byte is a JIS
0212 character and they should be greater than 0x7f

3) else if the first byte is greater than 0x7f then second and third
byte is a JIS 0208 character and they should be greater than 0x7f

4) else the byte is ASII and should be eqaul to or less than 0x7f

Apparently:

B7F6: this is ok. we can apply rule #3
BA7E: this is not good, since it satisfies non of rule #1 to #4

Thanks!

Jean-Christian Imbeault

PS I posted to HACKERS a few weeks ago about another bug (a real one :)
in the EUC-JP translation having to do with the WAVE DASH. I'll repost
here on the BUGS list, could you let me know the status of that BUG? Thanks!

Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is
always a problem since there are several different mappings among
different vendors/standards. I think I need more time to solve this.
--
Tatsuo Ishii