Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

Started by Ivan Zolotukhinover 18 years ago16 messagesgeneral
Jump to latest
#1Ivan Zolotukhin
ivan.zolotukhin@gmail.com

Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

--
Regards,
Ivan

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Ivan Zolotukhin (#1)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote:

Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

Well, the query as given by the user is invalid, so returning an error
message complaining about the invalid byte sequence seems entirely
reasonable.

I don't see any reason to try and be smart. There's no way you can
"fix" the query.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Ivan Zolotukhin (#1)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:

Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

Could you do not do this from the application itself, before sending
it to the DB server? mb_convert_encoding() in PHP for instance?

#4Ivan Zolotukhin
ivan.zolotukhin@gmail.com
In reply to: Martijn van Oosterhout (#2)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

Hello,

Well, PostgreSQL is correct entirely, I would post this message to the
-hackers list otherwise :) The question was rather about application
processing of user input not about change of database reaction on
broken UTF-8 string. But I am 100% sure one should fix the input in
this case since web site user can see some bad error (even if
application caught this SQL exception for instance) otherwise.

--
Regards,
Ivan

Show quoted text

On 8/15/07, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote:

Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

Well, the query as given by the user is invalid, so returning an error
message complaining about the invalid byte sequence seems entirely
reasonable.

I don't see any reason to try and be smart. There's no way you can
"fix" the query.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

From each according to his ability. To each according to his ability to litigate.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGwuvTIB7bNG8LQkwRAudJAJ9c8gvUQ25/S54gtJAPdqMOd81pNwCfUeLi
JoWU92WJKZ1qM3UMRG5Zn0Y=
=dPLv
-----END PGP SIGNATURE-----

#5Ivan Zolotukhin
ivan.zolotukhin@gmail.com
In reply to: Phoenix Kiula (#3)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

Hello,

Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
$str); when preparing string for SQL query and it worked. There's
probably a better way in PHP to achieve this: simply change default
values in php.ini for these parameters:

mbstring.encoding_translation = On
mbstring.substitute_character = none

and broken symbols will be automatically stripped off from the input
and output. But I am interested in general solution and better
practices anyway...

--
Regards,
Ivan

Show quoted text

On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:

Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

Could you do not do this from the application itself, before sending
it to the DB server? mb_convert_encoding() in PHP for instance?

#6Vick Khera
vivek@khera.org
In reply to: Ivan Zolotukhin (#1)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On Aug 15, 2007, at 7:41 AM, Ivan Zolotukhin wrote:

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

This rings of GIGO... if your user enters garbage, how do you know
what they wanted? You don't. You tell them they entered garbage,
else the result will be garbage. Data validation... learn from
microsoft's mistakes :-)

#7Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Ivan Zolotukhin (#5)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:

Hello,

Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
$str); when preparing string for SQL query and it worked. There's
probably a better way in PHP to achieve this: simply change default
values in php.ini for these parameters:

mbstring.encoding_translation = On
mbstring.substitute_character = none

and broken symbols will be automatically stripped off from the input
and output.

Sadly, they don't always do that, not with Asian scripts.

And I do not completely agree, like the other poster suggested, with
the concept of GIGO. Sometimes you want the end-user's experience to
be seamless. For example, in one of our web sites, we allow users to
submit text through a bookmarklet, where the title of the webpage
comes in rawurlencoded format. We try to rawurldecode() it on our end
but most of the times the Asian interpretation is wrong. We have all
the usual mbstring settings in php.ini. In this scenario, the user did
not enter any garbage. Our application should have the ability to
recognize the text. We do what we can with mb_convert...etc, but the
database just throws an error.

PGSQL really needs to get with the program when it comes to utf-8 input.

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: Phoenix Kiula (#7)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:

Hello,

Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
$str); when preparing string for SQL query and it worked. There's
probably a better way in PHP to achieve this: simply change default
values in php.ini for these parameters:

mbstring.encoding_translation = On
mbstring.substitute_character = none

and broken symbols will be automatically stripped off from the input
and output.

Sadly, they don't always do that, not with Asian scripts.

And I do not completely agree, like the other poster suggested, with
the concept of GIGO. Sometimes you want the end-user's experience to
be seamless. For example, in one of our web sites, we allow users to
submit text through a bookmarklet, where the title of the webpage
comes in rawurlencoded format. We try to rawurldecode() it on our end
but most of the times the Asian interpretation is wrong. We have all
the usual mbstring settings in php.ini. In this scenario, the user did
not enter any garbage. Our application should have the ability to
recognize the text. We do what we can with mb_convert...etc, but the
database just throws an error.

PGSQL really needs to get with the program when it comes to utf-8 input.

What, exactly, does that mean?

That PostgreSQL should take things in invalid utf-8 format and just store them?
Or that PostgreSQL should autoconvert from invalid utf-8 to valid
utf-8, guessing the proper codes?

Seriously, what do you want pgsql to do with these invalid inputs?

#9Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Scott Marlowe (#8)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

What, exactly, does that mean?

That PostgreSQL should take things in invalid utf-8 format and just store them?
Or that PostgreSQL should autoconvert from invalid utf-8 to valid
utf-8, guessing the proper codes?

Seriously, what do you want pgsql to do with these invalid inputs?

PG should let me, as the administrator of the database, decide whether
I mind my DB to have an option to:

1. Either allow the "invalid" input

2. Or to continue storing the other information in the table even if
an exception was thrown for the utf-8 column (which may be an
unrequired column, for instance, so I may want it not to block the
storage of other valid input which is more important)

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It's just a wishlist item from me, but I realize my voice may be
inconsequential.

#10Scott Marlowe
scott.marlowe@gmail.com
In reply to: Phoenix Kiula (#9)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

What, exactly, does that mean?

That PostgreSQL should take things in invalid utf-8 format and just store them?
Or that PostgreSQL should autoconvert from invalid utf-8 to valid
utf-8, guessing the proper codes?

Seriously, what do you want pgsql to do with these invalid inputs?

PG should let me, as the administrator of the database, decide whether
I mind my DB to have an option to:

1. Either allow the "invalid" input

2. Or to continue storing the other information in the table even if
an exception was thrown for the utf-8 column (which may be an
unrequired column, for instance, so I may want it not to block the
storage of other valid input which is more important)

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It's just a wishlist item from me, but I realize my voice may be
inconsequential.

Can't you just create your db as SQL_ASCII and let it store anything
then? I mean if you create a database in UTF-8 you're asking it to
carefully check the input. If you set it to SQL_ASCII, you're telling
it to store anything.

#11Ben
bench@silentmedia.com
In reply to: Phoenix Kiula (#9)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It does allow that: store it as a BLOB, and then treat it as text in your
app.

Me personally, if I expect something to be text, I expect it to be valid
text.

#12Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Ben (#11)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 16/08/07, Ben <bench@silentmedia.com> wrote:

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It does allow that: store it as a BLOB, and then treat it as text in your
app.

Me personally, if I expect something to be text, I expect it to be valid
text.

This is very useful, thanks. This would be "bytea"? Quick questions:

1. Even if it were bytea, would it work with regular SQL operators
such as regexp and LIKE?

2. Would tsearch2 work with bytea in the future as long as the stuff
in it was text?

#13Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Phoenix Kiula (#12)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On 16/08/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

On 16/08/07, Ben <bench@silentmedia.com> wrote:

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It does allow that: store it as a BLOB, and then treat it as text in your
app.

Me personally, if I expect something to be text, I expect it to be valid
text.

This is very useful, thanks. This would be "bytea"? Quick questions:

1. Even if it were bytea, would it work with regular SQL operators
such as regexp and LIKE?

2. Would tsearch2 work with bytea in the future as long as the stuff
in it was text?

I get the following error when I try to change a column that was
hitherto "varchar(500)" to bytea--

SQL error:
ERROR: column "title" cannot be cast to type "bytea"

In statement:
ALTER TABLE "links" ALTER COLUMN "title" TYPE bytea

Any thoughts on how I can overcome this limitation?

#14Ben
bench@silentmedia.com
In reply to: Phoenix Kiula (#12)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

1. Even if it were bytea, would it work with regular SQL operators
such as regexp and LIKE?

2. Would tsearch2 work with bytea in the future as long as the stuff
in it was text?

As far as I know, regexp, [i]like, tsearch2, etc. all require valid text
encodings, and so only work with text, not blobs. But I might be wrong.
I'm sure the manual would be enlightening. :)

#15Martijn van Oosterhout
kleptog@svana.org
In reply to: Phoenix Kiula (#12)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

On Thu, Aug 16, 2007 at 01:56:52AM +0800, Phoenix Kiula wrote:

This is very useful, thanks. This would be "bytea"? Quick questions:

1. Even if it were bytea, would it work with regular SQL operators
such as regexp and LIKE?

bytea is specifically designed for binary data, as such it has all
sorts of quoting rules for dealing with embedded nulls and such. It's
not quite a drop in replacement.

The earlier suggestion of SQL_ASCII is probably closer to what you
want. It does to regexes and LIKE, however postgres will treat all your
data as bytes. If you want you regexes to match Unicode character
classes that's too bad; you can't have it both ways. Sorting it goes in
byte order, you don't have a lot of choice there either.

2. Would tsearch2 work with bytea in the future as long as the stuff
in it was text?

Doubt it, SQL_ASCII would work though.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#16Lew
lew@lewscanon.com
In reply to: Phoenix Kiula (#9)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

Phoenix Kiula wrote:

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

Instead you want it to store tyrannically-chosen alternatives to the user's
invalid input without regard for their intent.

Calling rejection of invalid input "puritanical" is not an engineering evaluation.

--
Lew