control character check in JSON type seems broken

Started by Shigeru Hanadaalmost 14 years ago4 messagesbugs
Jump to latest
#1Shigeru Hanada
shigeru.hanada@gmail.com

Hi,

A Japanese user found strange behavior in JSON type, so I'd like to
share the issue here. He simply tested casting a string literal to json
type, and got an unexpected error when he used a Japanese word as name
and/or value of JSON object.

In the example below, "キー" is a Japanese word which means "key", and
its first letter has byte sequence "0xe3 0x82 0xad" in UTF-8.

postgres=# select '{"キー":"value"}'::json;
ERROR: invalid input syntax for type json
LINE 1: select '{"キー":"value"}'::json;
^
DETAIL: line 1: Character " ・ must be escaped.

With some debugging, I found that the problem is in json_lex_string().

json_lex_string() misjudges that the token "キー" contains naked
(not-escaped) control character, because its first byte is 0xe3 and it's
-29 for signed char interpreting, and it's less than 32. We need to
cast to unsigned char (or use unsigned variable) here.

In addition, error message above seems corrupted in my environment.
Here we check not-escaped control character, so printing it with %c
formatting might break log files. How about using decimal or hex dump
in such cases?

Please see attached patch which contains workaround for this issue.

Regards,
--
Shigeru HANADA

Attachments:

fix_json_check.patchtext/plain; charset=Shift_JIS; name=fix_json_check.patchDownload+4-5
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shigeru Hanada (#1)
Re: control character check in JSON type seems broken

Shigeru Hanada <shigeru.hanada@gmail.com> writes:

json_lex_string() misjudges that the token "キー" contains naked
(not-escaped) control character, because its first byte is 0xe3 and it's
-29 for signed char interpreting, and it's less than 32. We need to
cast to unsigned char (or use unsigned variable) here.

Yeah, that's wrong.

In addition, error message above seems corrupted in my environment.
Here we check not-escaped control character, so printing it with %c
formatting might break log files. How about using decimal or hex dump
in such cases?

And so is that. IMO the error reporting in this module could stand to
be reviewed altogether for compliance with our message guidelines.
(For starters, why is it using errdetail_internal?) I refrained from
editorializing on-the-fly, but I'm not too pleased with what I saw.

Patch applied, thanks!

regards, tom lane

#3Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#2)
Re: control character check in JSON type seems broken

On Mon, Jun 4, 2012 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In addition, error message above seems corrupted in my environment.
Here we check not-escaped control character, so printing it with %c
formatting might break log files.  How about using decimal or hex dump
in such cases?

And so is that.  IMO the error reporting in this module could stand to
be reviewed altogether for compliance with our message guidelines.
(For starters, why is it using errdetail_internal?)  I refrained from
editorializing on-the-fly, but I'm not too pleased with what I saw.

Huh. I have no idea why I thought errdetail_internal was a good idea.
Should we just change all those to errdetail?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#3)
Re: control character check in JSON type seems broken

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jun 4, 2012 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

And so is that. �IMO the error reporting in this module could stand to
be reviewed altogether for compliance with our message guidelines.
(For starters, why is it using errdetail_internal?) �I refrained from
editorializing on-the-fly, but I'm not too pleased with what I saw.

Huh. I have no idea why I thought errdetail_internal was a good idea.
Should we just change all those to errdetail?

Yeah, they're clearly user-facing so I see no reason why they shouldn't
be translatable.

regards, tom lane