BUG #5532: Valid UTF8 sequence errors as invalid

Started by Mike Lewisalmost 16 years ago5 messagesbugs
Jump to latest
#1Mike Lewis
mikelikespie@gmail.com

The following bug has been logged online:

Bug reference: 5532
Logged by: Michael Lewis
Email address: mikelikespie@gmail.com
PostgreSQL version: 9.0 trunk
Operating system: OS X
Description: Valid UTF8 sequence errors as invalid
Details:

I'm using Python to sanitize my logs from invalid UTF8 characters before
COPYing them into postgres. I came across this one sequence that seems to
be valid UTF8 (in the extended range I believe).

It goes through both pythons encoding as well as iconv without an error and
is valid as far as my understanding of UTF8 goes so I am assuming it is a
bug.

Test case:

create table t (v varchar);
insert into t values (E'\xed\xbc\xad');

In bash you can do:

echo -e "\xed\xbc\xad" | iconv -f UTF-8 ; echo $?

to validate it

Thanks,
Mike

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mike Lewis (#1)
Re: BUG #5532: Valid UTF8 sequence errors as invalid

"Michael Lewis" <mikelikespie@gmail.com> writes:

I'm using Python to sanitize my logs from invalid UTF8 characters before
COPYing them into postgres. I came across this one sequence that seems to
be valid UTF8 (in the extended range I believe).

It is not valid.  See http://tools.ietf.org/html/rfc3629 --- a sequence
beginning with ED must have a second byte in the range 80-9F to be
legal, and this doesn't.  The example you give would decode as U+DF2D,
ie part of a surrogate pair, which is specifically disallowed in UTF8
--- you're supposed to code the original character directly, not via a
surrogate pair.  The primary reason for this rule is that otherwise
there are multiple ways to encode the same character, which can be a
security hazard.

It goes through both pythons encoding as well as iconv without an error

You should file bugs against those tools.

regards, tom lane

#3Mike Lewis
mikelikespie@gmail.com
In reply to: Tom Lane (#2)
Re: BUG #5532: Valid UTF8 sequence errors as invalid
It is not valid.  See http://tools.ietf.org/html/rfc3629 --- a sequence
beginning with ED must have a second byte in the range 80-9F to be
legal, and this doesn't.  The example you give would decode as U+DF2D,
ie part of a surrogate pair, which is specifically disallowed in UTF8
--- you're supposed to code the original character directly, not via a
surrogate pair.  The primary reason for this rule is that otherwise
there are multiple ways to encode the same character, which can be a
security hazard.

Thanks for the explanation. Unicode has always given me a hard time.

You should file bugs against those tools.

I certainly will. I apologize for filing the bug against postgres (I

suppose the "voting" method of figuring out which piece software is the
buggy one has failed me).

I've run into a fair amount of unicode errors when trying to copy in log
files. Would you recommend using bytea or another data type instead of text
or varchar... or at least copying to a staging table with bytea's and
filtering out invalid rows when moving it to the main table?

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mike Lewis (#3)
Re: BUG #5532: Valid UTF8 sequence errors as invalid

Mike Lewis <mikelikespie@gmail.com> writes:

I've run into a fair amount of unicode errors when trying to copy in log
files. Would you recommend using bytea or another data type instead of text
or varchar... or at least copying to a staging table with bytea's and
filtering out invalid rows when moving it to the main table?

My guess is that you're working with data that was originally
represented in UTF16, and you've used a tool that doesn't really know
what it's doing to convert to UTF8. A correct conversion has to reunite
surrogate pairs into wider-than-16-bit Unicode characters and then
encode those as single UTF8 sequences. Dunno if you can easily identify
the culprit, but fixing that conversion is the long-term solution.

(BTW, I should think that iconv or some related tool would have a
solution for fixing this miscoding; it's not an uncommon problem.)

regards, tom lane

#5Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Tom Lane (#4)
Re: BUG #5532: Valid UTF8 sequence errors as invalid

Tom Lane <tgl@sss.pgh.pa.us> writes:

(BTW, I should think that iconv or some related tool would have a
solution for fixing this miscoding; it's not an uncommon problem.)

I guess recode is handling that.

http://recode.progiciels-bpi.ca/manual/Universal.html#Universal

Regards,
--
dim