encoding advice requested

Started by Rick Schumeyerover 19 years ago9 messagesgeneral
Jump to latest
#1Rick Schumeyer
rschumeyer@ieee.org

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and submit
it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes. The
database does not appreciate it when the application attempts to store
these characters.

What is the best option to deal with this problem?

a) I think I could re-create the database with a LATIN1 encoding. I'm
not real experienced with different encodings, are there any issues with
combining en_US and LATIN1?

b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every time I
open a connection. A brief test indicates this will work.

c) I can figure out how to filter the text in the application
program...but this seems like wasted work considering I could use either
option A or B.

Opinions?

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Rick Schumeyer (#1)
Re: encoding advice requested

On Sat, Nov 11, 2006 at 02:45:00PM -0500, Rick Schumeyer wrote:

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and submit
it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes. The
database does not appreciate it when the application attempts to store
these characters.

What encoding does your application use? Whatever that is, that's what
you should be using in your SET client_encoding statement. Note you can
set that variable per user also.

b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every time I
open a connection. A brief test indicates this will work.

This is the solution, assuming ofcourse your application is in Latin1.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#3Shoaib Mir
shoaibmir@gmail.com
In reply to: Martijn van Oosterhout (#2)
Re: encoding advice requested

Using the convert function might be of help here as well:

convert(string using conversion_name)

Change encoding using specified conversion name. Conversions can be defined
by CREATE CONVERSION. Also there are some pre-defined conversion names (
http://www.postgresql.org/docs/current/static/functions-string.html#CONVERSION-NAMES
).

select convert('PostgreSQL' using iso_8859_1_to_utf8)

Regards,
--------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)

Show quoted text

On 11/12/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Sat, Nov 11, 2006 at 02:45:00PM -0500, Rick Schumeyer wrote:

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and submit
it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes. The
database does not appreciate it when the application attempts to store
these characters.

What encoding does your application use? Whatever that is, that's what
you should be using in your SET client_encoding statement. Note you can
set that variable per user also.

b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every time I
open a connection. A brief test indicates this will work.

This is the solution, assuming ofcourse your application is in Latin1.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

From each according to his ability. To each according to his ability to

litigate.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFVkxXIB7bNG8LQkwRAtGbAJ0ZQN9xD+JtTDFHL50gIDjf6EpGBACfQi2b
Fsz8TS1Cnnh7tt6J345H8uo=
=QMqu
-----END PGP SIGNATURE-----

#4Daniel Verite
daniel@manitou-mail.org
In reply to: Rick Schumeyer (#1)
Re: encoding advice requested

Rick Schumeyer wrote:

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and submit
it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes. The
database does not appreciate it when the application attempts to store
these characters.

What is the best option to deal with this problem?

a) I think I could re-create the database with a LATIN1 encoding. I'm
not real experienced with different encodings, are there any issues with
combining en_US and LATIN1?
b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every time I
open a connection. A brief test indicates this will work.

Be aware that "fancy" quotes and apostrophes are not representable in
LATIN1, the closest character set in which they are is probably
WIN1252. See http://en.wikipedia.org/wiki/Windows-1252, especially
characters in the 0x91-0x94 range.
Maybe your application implicitly uses this encoding, especially
if it runs under Windows, in which case the more appropriate
solution to your problem would be to set the client_encoding to
WIN1252 while keeping your database in UTF8.

--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

#5Albe Laurenz
all@adv.magwien.gv.at
In reply to: Daniel Verite (#4)
Re: encoding advice requested

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and

submit

it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes.

The

database does not appreciate it when the application attempts to

store

these characters.

What is the best option to deal with this problem?

a) I think I could re-create the database with a LATIN1 encoding.

I'm

not real experienced with different encodings, are there any issues

with

combining en_US and LATIN1?
b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every

time I

open a connection. A brief test indicates this will work.

Be aware that "fancy" quotes and apostrophes are not representable in
LATIN1, the closest character set in which they are is probably
WIN1252. See http://en.wikipedia.org/wiki/Windows-1252, especially
characters in the 0x91-0x94 range.
Maybe your application implicitly uses this encoding, especially
if it runs under Windows, in which case the more appropriate
solution to your problem would be to set the client_encoding to
WIN1252 while keeping your database in UTF8.

This is good advice!

To add an answer to your second question:

You can
ALTER ROLE username SET client_encoding = WIN1252
to make this encoding the default for this user.

If you want to change the setting for all users connecting
to this database, you can also
ALTER DATABASE mydb SET client_encoding = WIN1252

Yours,
Laurenz Albe

#6Rick Schumeyer
rschumeyer@ieee.org
In reply to: Albe Laurenz (#5)
Re: encoding advice requested

Albe Laurenz wrote:

My database locale is en_US, and by default my databases are UTF8.

My application code allows the user to paste text into a box and

submit

it to the database. Sometimes the pasted text contains non UTF8
characters, typically the "fancy" forms of quotes and apostrophes.

The

database does not appreciate it when the application attempts to

store

these characters.

What is the best option to deal with this problem?

a) I think I could re-create the database with a LATIN1 encoding.

I'm

not real experienced with different encodings, are there any issues

with

combining en_US and LATIN1?
b) I can issue a SET CLIENT_ENCODING TO 'LATIN1'; statement every

time I

open a connection. A brief test indicates this will work.

Be aware that "fancy" quotes and apostrophes are not representable in
LATIN1, the closest character set in which they are is probably
WIN1252. See http://en.wikipedia.org/wiki/Windows-1252, especially
characters in the 0x91-0x94 range.
Maybe your application implicitly uses this encoding, especially
if it runs under Windows, in which case the more appropriate
solution to your problem would be to set the client_encoding to
WIN1252 while keeping your database in UTF8.

This is good advice!

To add an answer to your second question:

You can
ALTER ROLE username SET client_encoding = WIN1252
to make this encoding the default for this user.

If you want to change the setting for all users connecting
to this database, you can also
ALTER DATABASE mydb SET client_encoding = WIN1252

Yours,
Laurenz Albe

I will have to try the WIN1252 encoding.

On the client side, my application is a web browser. On the server
side, it is php scripts on a linux box. The data comes from copying
data from a browser window (pointing to another web site) and pasting it
into an html textarea, which is then submitted.

Given this, would you still suggest the WIN1252 encoding?

#7Marcus Engene
mengpg@engene.se
In reply to: Rick Schumeyer (#6)
Re: encoding advice requested

Rick Schumeyer skrev:

I will have to try the WIN1252 encoding.

On the client side, my application is a web browser. On the server
side, it is php scripts on a linux box. The data comes from copying
data from a browser window (pointing to another web site) and pasting it
into an html textarea, which is then submitted.
Given this, would you still suggest the WIN1252 encoding?

In my setup I compiled php with
--enable-zend-multibyte
...which makes all strings unicode internally (I suppose they use
wchar_t instead of char or something). Thus mb_*() are [from what I can
tell] not necessary [for me] anymore. Do use a fairly recent php, not
only for bind variables in the pg api.

In php.ini i've got
default_charset = "utf-8"
mbstring.internal_encoding = UTF-8;

in the html head:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The db is in utf-8.

Flawlessly it has saved everything I've tossed at it, including all
sorts of apostrophes. I've copy & pasted chinese, hebrew, swedish,
arabic... texts into <textarea> with no other problem that hebrew and
arabic makes most sense written from right to left ;-)

Best regards,
Marcus

#8Daniel Verite
daniel@manitou-mail.org
In reply to: Rick Schumeyer (#6)
Re: encoding advice requested

Rick Schumeyer wrote:

I will have to try the WIN1252 encoding.

On the client side, my application is a web browser. On the server
side, it is php scripts on a linux box. The data comes from copying
data from a browser window (pointing to another web site) and pasting it
into an html textarea, which is then submitted.

Given this, would you still suggest the WIN1252 encoding?

No, sticking to utf-8 is safer. Because in the context you describe, it's the
browser that decides the character set and encoding of the textarea data it has
to submit to the HTTP server. There's a problem when the page that contains the
textarea is US-ASCII for example, but the user pastes some non US-ASCII
characters. Then the browser has to choose a non US-ASCII encoding for the
data, possibly one that the server-side script doesn't expect. I assume this is
what happens in your case and the reason of the error you're getting. An easy
solution is to use utf-8 for the webpage, so the browser won't have to switch
to another encoding since every character is supposed to have a representation
in utf-8, "fancy quotes" and everything else.
Also, you'll find this extensively and better explained in this article, for
example:
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

#9Martijn van Oosterhout
kleptog@svana.org
In reply to: Daniel Verite (#8)
Re: encoding advice requested

On Tue, Nov 14, 2006 at 12:01:44AM +0100, Daniel Verite wrote:

Also, you'll find this extensively and better explained in this article, for
example:
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

This is a *really* good article about character sets and form
submission. Especially the tip about how to get the browser to tell you
what encoding it used.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.