Strange UTF-8 behaviour

Started by Marco Ferrettiover 21 years ago6 messagesgeneral
Jump to latest
#1Marco Ferretti
marco.ferretti@jrc.it

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<small><font face="Century Gothic">Hi there all. <br>
I am quite new to Postgres, so forgive me if this question seems
obvious. <br>
<br>
I have created a database with the UTF-8 encoding&nbsp; (createdb cassa
--encoding=UTF-8) .<br>
Then I have made the following tests :<br>
<br>
</font></small><small><font face="Century Gothic">cassa=&gt; </font></small><small><font
face="Century Gothic">create table test(id varchar(5));<br>
cassa=&gt; insert into test values ('12345');<br>
INSERT 178725 1<br>
cassa=&gt; insert into test values ('123&egrave;');<br>
INSERT 178726 1<br>
cassa=&gt; insert into test values ('1234&egrave;');<br>
ERROR:&nbsp; value too long for type character varying(5)<br>
<br>
<br>
but if I try <br>
cassa=&gt; select '#' || id || '#' from test;<br>
&nbsp;?column?<br>
----------<br>
&nbsp;#12345#<br>
&nbsp;#123&egrave;#<br>
(2 rows)<br>
<br>
<br>
so, apparently the chars are stored the rigth way (</font></small><small><font
face="Century Gothic"> #123&egrave;#) but when trying the query the &egrave; char is
parsed as&nbsp; 2 chars ....<br>
<br>
The database server version is 7.3.4 on a RedHat 9 machine ...<br>
<br>
Any clue ?<br>
<br>
Tia <br>
&nbsp;&nbsp;&nbsp; Marco<br>
</font></small><small><font face="Century Gothic"><br>
<br>
</font></small>
<pre class="moz-signature" cols="72">--
Ever noticed how fast windows run ? neither did I

</pre>
</body>
</html>

#2Dennis Gearon
gearond@fireserve.net
In reply to: Marco Ferretti (#1)
Re: Strange UTF-8 behaviour

My guess is that something in the chain of getting the data into the
database is measuring:

BYTES
not
CHARACTERS.

"Marco Ferretti" <marco.ferretti@jrc.it> wrote:
</quote--------------------------------------->
<snip>
I have created a database with the UTF-8 encoding (createdb cassa
--encoding=UTF-8) .
Then I have made the following tests :

cassa=> create table test(id varchar(5));
cassa=> insert into test values ('12345');
INSERT 178725 1
cassa=> insert into test values ('123è');
INSERT 178726 1
cassa=> insert into test values ('1234è');
ERROR: value too long for type character varying(5)
<snip>
so, apparently the chars are stored the rigth way ( #123è#) but when
trying the query the è char is parsed as 2 chars ....

The database server version is 7.3.4 on a RedHat 9 machine ...

Any clue ?
</quote--------------------------------------->

#3Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Marco Ferretti (#1)
Re: Strange UTF-8 behaviour

On Thu, Sep 16, 2004 at 06:10:13PM +0200, Marco Ferretti wrote:

I am quite new to Postgres, so forgive me if this question seems
obvious. <br>
<br>
I have created a database with the UTF-8 encoding&nbsp; (createdb cassa
--encoding=UTF-8) .<br>
Then I have made the following tests :<br>

FWIW, I can't reproduce this using 7.3.6. Is there anything special
about your 'e' character, or it's a plain 'e'?

$ createdb test --encoding=UTF-8
CREATE DATABASE
COMMENT

$ psql test
Welcome to psql 7.3.6, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help on internal slash commands
\g or terminate with semicolon to execute query
\q to quit

test=# create table test (id char(5));
CREATE TABLE
test=# insert into test values ('1234e');
INSERT 16993 1
test=# create table test2 (id varchar(5));
CREATE TABLE
test=# insert into test2 values ('1234e');
INSERT 16996 1
test=# insert into test2 values ('123e');
INSERT 16997 1
test=# select '#' || id || '#', length(id) from test2;
?column? | length
----------+--------
#1234e# | 5
#123e# | 4
(2 rows)

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Escucha y olvidar�s; ve y recordar�s; haz y entender�s" (Confucio)

#4Matteo Beccati
php@beccati.com
In reply to: Alvaro Herrera (#3)
Re: Strange UTF-8 behaviour

Hi Alvaro,

FWIW, I can't reproduce this using 7.3.6. Is there anything special
about your 'e' character, or it's a plain 'e'?

Maybe you didn't get the email correctly. It was an e with grave
accent:, just like this:

� (UTF-8 encoded)

I just checked on PG 7.4.3 / NetBSD, with this results:

egrave=# CREATE TABLE test (data varchar(5));
CREATE
egrave=# show server_encoding ;
client_encoding
-----------------
UNICODE
(1 row)

egrave=# show client_encoding ; -- don't know why it is set to unicode
client_encoding
-----------------
UNICODE
(1 row)

egrave=# INSERT INTO test VALUES ('1234�');
egrave'# '\r
Query buffer reset (cleared).
egrave=# set client_encoding = 'ISO8859-1';
SET
egrave=# show client_encoding ;
client_encoding
-----------------
ISO8859-1
(1 row)

egrave=# INSERT INTO test VALUES ('1234�');
INSERT 25340 1
egrave=# SELECT * FROM test;
data
------
1234�
(1 row)

It seems all is working when client encoding is set correctly up. Try to
check you client and server encoding.

I've also double checked with:

egrave=# SET client_encoding = 'ISO8859-2';
SET
egrave=# SELECT * FROM test;
WARNING: ignoring unconvertible UTF-8 character 0xc3a8
data
------
1234
(1 row)

Best regards
--
Matteo Beccati
http://phpadsnew.com/
http://phppgads.com/

#5Matteo Beccati
php@beccati.com
In reply to: Matteo Beccati (#4)
Re: Strange UTF-8 behaviour

Hi,

è (UTF-8 encoded)

Sorry, I actually forgot to switch encoding :)

I just hope the last part of the email was readable.

Ciao ciao
--
Matteo Beccati
http://phpadsnew.com/
http://phppgads.com/

#6Marco Ferretti
marco.ferretti@jrc.it
In reply to: Matteo Beccati (#5)
Re: Strange UTF-8 behaviour

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#330099">
Thanks to all you guys ! You really helped<br>
<br>
marco <br>
</body>
</html>