character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

Started by Andreas Kalschover 16 years ago12 messagesgeneral
Jump to latest
#1Andreas Kalsch
andreaskalsch@gmx.de

The function "convert_to(string text, dest_encoding name)" will throw an
error and so break my application when not supported characters are
included in the unicode string.
So what can I do
- to filter characters out which have no counterpart in the latin codesets
- or to simple ignore wrong characters?

Problem: Users will enter _any_ characters in my application and an
error really doesn't help in this case.

What I am searching for is a function to undiacritic special letters to
simple ones.

There is provided an example -
http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks#Diacritic_removing
- which will not work because of the error, when I put _any_ valid UTF8
character to the functions.

Best,

Andi

#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andreas Kalsch (#1)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

2009/8/3 Andreas Kalsch <andreaskalsch@gmx.de>:

The function "convert_to(string text, dest_encoding name)" will throw an
error and so break my application when not supported characters are included
in the unicode string.
So what can I do
- to filter characters out which have no counterpart in the latin codesets
- or to simple ignore wrong characters?

Problem: Users will enter _any_ characters in my application and an error
really doesn't help in this case.

What I am searching for is a function to undiacritic special letters to
simple ones.

There is provided an example -
http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks#Diacritic_removing -
which will not work because of the error, when I put _any_ valid UTF8
character to the functions.

some people use iconv for this task

http://markmail.org/message/2jpp7p26ohreqnsh?q=plperlu+iconv+postgresql&amp;page=1&amp;refer=2jpp7p26ohreqnsh

Regards
Pavel Stehule

Show quoted text

Best,

Andi

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andreas Kalsch (#1)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

Andreas Kalsch wrote:

The function "convert_to(string text, dest_encoding name)" will
throw an error and so break my application when not supported
characters are included in the unicode string.
So what can I do
- to filter characters out which have no counterpart in the latin codesets
- or to simple ignore wrong characters?

Perhaps this is useful:

http://wiki.postgresql.org/wiki/Strip_accents_from_strings

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#4Andreas Kalsch
andreaskalsch@gmx.de
In reply to: Pavel Stehule (#2)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

So there is definitely no way to this natively? Which would be better
because this an easy task, which should be part of the main distribution.

What is more performant - has anyone made a benchmark?

1) Perl:
http://markmail.org/message/2jpp7p26ohreqnsh?q=plperlu+iconv+postgresql&amp;page=1&amp;refer=2jpp7p26ohreqnsh
2) Python: http://wiki.postgresql.org/wiki/Strip_accents_from_strings
... Other languages

Thank you,

Andi

#5Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Andreas Kalsch (#1)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

On 3 Aug 2009, at 20:32, Andreas Kalsch wrote:

Problem: Users will enter _any_ characters in my application and an
error really doesn't help in this case.

I think the real problem is: Where do you lose the original encoding
the users input their data with? If you specify that encoding on the
connection and send it to a database that can handle UTF-8 then you
shouldn't be getting any conversion problems in the first place.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a77688810131526383955!

#6Andreas Kalsch
andreaskalsch@gmx.de
In reply to: Alban Hertroys (#5)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

Alban Hertroys schrieb:

On 3 Aug 2009, at 20:32, Andreas Kalsch wrote:

Problem: Users will enter _any_ characters in my application and an
error really doesn't help in this case.

I think the real problem is: Where do you lose the original encoding
the users input their data with? If you specify that encoding on the
connection and send it to a database that can handle UTF-8 then you
shouldn't be getting any conversion problems in the first place.

Nowhere - I will validate input data on the client side (PHP or Python)
and send it to the server. Of course the only encoding I will use on any
side is UTF8. I just wnated to use this Latin thing for simplification
of characters. But it seems that there is no real solution in Postgres.
MySQL dies it automagically. You can search for "H�tel" and get "hotel",
too.

So I want to use the simplification for indexing.

My question again: Is there a native Postgres solution to simplify
characters consistently? It means to completely remove all diacriticals
from Unicode characters.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

Oh yes - we need to care about the forest and not about every single tree ;)

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andreas Kalsch (#6)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

Andreas Kalsch wrote:

My question again: Is there a native Postgres solution to simplify
characters consistently? It means to completely remove all
diacriticals from Unicode characters.

There's a to_ascii() function but it supports a subset of charsets, and
IIRC UTF8 is not one of them. Patches welcome.

I will validate input data on the client side (PHP or Python) and send
it to the server. Of course the only encoding I will use on any side
is UTF8. I just wnated to use this Latin thing for simplification of
characters.

Hmm, seems you're using the wrong tool for that purpose. Changing to a
different encoding does not remove any diacritical marks, only change
the underlying byte encoding.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#8Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Andreas Kalsch (#6)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

On 4 Aug 2009, at 24:57, Andreas Kalsch wrote:

I think the real problem is: Where do you lose the original
encoding the users input their data with? If you specify that
encoding on the connection and send it to a database that can
handle UTF-8 then you shouldn't be getting any conversion problems
in the first place.

Nowhere - I will validate input data on the client side (PHP or
Python) and send it to the server. Of course the only encoding I
will use on any side is UTF8. I just wnated to use this Latin thing
for simplification of characters.

Yes you are. How could your users input invalid characters in the
first place if that were not the case? You're not suggesting they
managed to enter characters in an encoding for which they weren't
valid on their own systems, do you?[1]There exists of course a small group of people who enjoy posting raw byte data to a web-form, but would it matter whether they'd get an error about their encoding or not? They do not intend to enter valid data after all ;)

You say your client is using PHP or Python, which suggests it's a
website. That means the input goes like this: web browser -> website -

database. All three of those steps use some encoding and you can

take them into account. That should prevent this problem altogether.

You have control over which encoding your client and the database use,
and the web browser tells what encoding it used in the POST request so
you can pass that along to the database when storing data or convert
it in your client.

[1]: There exists of course a small group of people who enjoy posting raw byte data to a web-form, but would it matter whether they'd get an error about their encoding or not? They do not intend to enter valid data after all ;)
raw byte data to a web-form, but would it matter whether they'd get an
error about their encoding or not? They do not intend to enter valid
data after all ;)

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a7820e510131352719414!

#9Andreas Kalsch
andreaskalsch@gmx.de
In reply to: Alban Hertroys (#8)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

Alban,

what I do to simplify the data chain:

HTTP encoding > PHP string encoding > client connection > server - all
is UTF8. Plus invalid byte check in PHP (or server).

What I have tested inside Postgres is entering a 3 byte UTF8 character
to this function. And I have got an error. This is a character I will
not filter out, if some Unicode artists will enter it. It is an
international website and the simplification is just for indexing.

But I think that this will not solve the problem and I have to use
Python or Perl to get it done.

Alban Hertroys schrieb:

Show quoted text

On 4 Aug 2009, at 24:57, Andreas Kalsch wrote:

I think the real problem is: Where do you lose the original encoding
the users input their data with? If you specify that encoding on the
connection and send it to a database that can handle UTF-8 then you
shouldn't be getting any conversion problems in the first place.

Nowhere - I will validate input data on the client side (PHP or
Python) and send it to the server. Of course the only encoding I will
use on any side is UTF8. I just wnated to use this Latin thing for
simplification of characters.

Yes you are. How could your users input invalid characters in the
first place if that were not the case? You're not suggesting they
managed to enter characters in an encoding for which they weren't
valid on their own systems, do you?[1]

You say your client is using PHP or Python, which suggests it's a
website. That means the input goes like this: web browser -> website
-> database. All three of those steps use some encoding and you can
take them into account. That should prevent this problem altogether.

You have control over which encoding your client and the database use,
and the web browser tells what encoding it used in the POST request so
you can pass that along to the database when storing data or convert
it in your client.

[1] There exists of course a small group of people who enjoy posting
raw byte data to a web-form, but would it matter whether they'd get an
error about their encoding or not? They do not intend to enter valid
data after all ;)

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:933,4a7820e310131447310801!

#10Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Andreas Kalsch (#9)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

On 4 Aug 2009, at 15:02, Andreas Kalsch wrote:

Alban,

what I do to simplify the data chain:

HTTP encoding > PHP string encoding > client connection > server -
all is UTF8. Plus invalid byte check in PHP (or server).

You're missing my point. You start dealing with the encoding of the
data you receive too late. Here's what happens:

1). User enters data in your form using his web browser. This user has
set his OS to some locale and encoding, say it's WIN1252-JP.

2). User presses "submit" and the web-browser generates a POST request
that CONTAINS the encoding the user used to enter his data.

3). Your scripts process this data as if it were UTF-8, but it ISN'T
UTF-8, it's WIN1252-JP.

4). You try to store this (now) mixedly encoded data into an UTF-8
encoded database and it fails for obvious reasons.

To solve this you can try to filter out "wrong" characters, but they
are not wrong. They're just differently encoded than you think they are.

What you should be doing is look at the encoding specified in the POST
request and either:

a.) convert the data to UTF-8 before you start processing it (both PHP
and Python have functions to do that) or,

b.) pass the encoding on to your database queries so that the database
knows how to convert them to UTF-8.

Of course it can't hurt to verify that the encoding specified in the
POST request does indeed match the posted data, but that's a different
story.

P.S. Please don't top post and keep some context of what you're
replying to. Your messages are a bit confusing the way you write them.

What I have tested inside Postgres is entering a 3 byte UTF8
character to this function. And I have got an error. This is a
character I will not filter out, if some Unicode artists will enter
it. It is an international website and the simplification is just
for indexing.

But I think that this will not solve the problem and I have to use
Python or Perl to get it done.

Alban Hertroys schrieb:

On 4 Aug 2009, at 24:57, Andreas Kalsch wrote:

I think the real problem is: Where do you lose the original
encoding the users input their data with? If you specify that
encoding on the connection and send it to a database that can
handle UTF-8 then you shouldn't be getting any conversion
problems in the first place.

Nowhere - I will validate input data on the client side (PHP or
Python) and send it to the server. Of course the only encoding I
will use on any side is UTF8. I just wnated to use this Latin
thing for simplification of characters.

Yes you are. How could your users input invalid characters in the
first place if that were not the case? You're not suggesting they
managed to enter characters in an encoding for which they weren't
valid on their own systems, do you?[1]

You say your client is using PHP or Python, which suggests it's a
website. That means the input goes like this: web browser ->
website -> database. All three of those steps use some encoding and
you can take them into account. That should prevent this problem
altogether.

You have control over which encoding your client and the database
use, and the web browser tells what encoding it used in the POST
request so you can pass that along to the database when storing
data or convert it in your client.

[1] There exists of course a small group of people who enjoy
posting raw byte data to a web-form, but would it matter whether
they'd get an error about their encoding or not? They do not intend
to enter valid data after all ;)

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a784d8810131832511190!

#11Andreas Kalsch
andreaskalsch@gmx.de
In reply to: Alban Hertroys (#10)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

I know what you are talking about, but I am not sure how many websites
really check for incoming encoding. Usually you trust that the client
will use the same encoding for sending data as the server has sent.
(This is what I mean with my simplified chain)

It's some extra work to do converting from exotic sets and it's the way
to go for a site with high traffic. I would prefer to do it on the
PHP/Python side to send in one encoding to the database server, because
it could be that you additionally send own strings.

Alban Hertroys schrieb:

Show quoted text

On 4 Aug 2009, at 15:02, Andreas Kalsch wrote:

Alban,

what I do to simplify the data chain:

HTTP encoding > PHP string encoding > client connection > server -
all is UTF8. Plus invalid byte check in PHP (or server).

You're missing my point. You start dealing with the encoding of the
data you receive too late. Here's what happens:

1). User enters data in your form using his web browser. This user has
set his OS to some locale and encoding, say it's WIN1252-JP.

2). User presses "submit" and the web-browser generates a POST request
that CONTAINS the encoding the user used to enter his data.

3). Your scripts process this data as if it were UTF-8, but it ISN'T
UTF-8, it's WIN1252-JP.

4). You try to store this (now) mixedly encoded data into an UTF-8
encoded database and it fails for obvious reasons.

To solve this you can try to filter out "wrong" characters, but they
are not wrong. They're just differently encoded than you think they are.

What you should be doing is look at the encoding specified in the POST
request and either:

a.) convert the data to UTF-8 before you start processing it (both PHP
and Python have functions to do that) or,

b.) pass the encoding on to your database queries so that the database
knows how to convert them to UTF-8.

Of course it can't hurt to verify that the encoding specified in the
POST request does indeed match the posted data, but that's a different
story.

P.S. Please don't top post and keep some context of what you're
replying to. Your messages are a bit confusing the way you write them.

What I have tested inside Postgres is entering a 3 byte UTF8
character to this function. And I have got an error. This is a
character I will not filter out, if some Unicode artists will enter
it. It is an international website and the simplification is just for
indexing.

But I think that this will not solve the problem and I have to use
Python or Perl to get it done.

Alban Hertroys schrieb:

On 4 Aug 2009, at 24:57, Andreas Kalsch wrote:

I think the real problem is: Where do you lose the original
encoding the users input their data with? If you specify that
encoding on the connection and send it to a database that can
handle UTF-8 then you shouldn't be getting any conversion problems
in the first place.

Nowhere - I will validate input data on the client side (PHP or
Python) and send it to the server. Of course the only encoding I
will use on any side is UTF8. I just wnated to use this Latin thing
for simplification of characters.

Yes you are. How could your users input invalid characters in the
first place if that were not the case? You're not suggesting they
managed to enter characters in an encoding for which they weren't
valid on their own systems, do you?[1]

You say your client is using PHP or Python, which suggests it's a
website. That means the input goes like this: web browser -> website
-> database. All three of those steps use some encoding and you can
take them into account. That should prevent this problem altogether.

You have control over which encoding your client and the database
use, and the web browser tells what encoding it used in the POST
request so you can pass that along to the database when storing data
or convert it in your client.

[1] There exists of course a small group of people who enjoy posting
raw byte data to a web-form, but would it matter whether they'd get
an error about their encoding or not? They do not intend to enter
valid data after all ;)

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:933,4a784d8610131719966171!

#12Bruce Momjian
bruce@momjian.us
In reply to: Alban Hertroys (#10)
Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

On Tue, Aug 4, 2009 at 4:02 PM, Alban
Hertroys<dalroi@solfertje.student.utwente.nl> wrote:

P.S. Please don't top post and keep some context of what you're replying to.
Your messages are a bit confusing the way you write them.

These arguments are more convincing if you don't leave the remainder
of the OP's message quoted right below them...

--
greg
http://mit.edu/~gsstark/resume.pdf