Unicode confusion

Started by Chris Palmeralmost 23 years ago6 messagesgeneral
Jump to latest
#1Chris Palmer
chris.palmer@geneed.com

Hello,

As you can see, at:

http://www.nodewarrior.org/chris/unicode.png

I am very confused. My database is set to Unicode encoding:

psql -p 9000 -l

List of databases
Name | Owner | Encoding
-----------+------------------------+-----------
japanese | GENEEDINC+chris.palmer | EUC_JP
template0 | GENEEDINC+chris.palmer | SQL_ASCII
template1 | GENEEDINC+chris.palmer | SQL_ASCII
test | GENEEDINC+chris.palmer | SQL_ASCII
unicode | GENEEDINC+chris.palmer | UNICODE

I'm using Pg 7.3.2 and pg73jdbc3.jar.

According to *The Java Programming Language, Third Edition* (p. 138), "...you can use the escape sequence \uxxxx to encode Unicode characters, where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f 0b87" in the hex editor? It seems I'm not getting the same stuff out that I am putting in. psql is not much help; it just shows wacky characters (4 of them: "â¯à®").

Am I doing something wrong? Does something need to be set in the database or in the JDBC Connection object? Or am I just a confused monkey?

Thanks in advance for any clues!

--
Chris Palmer Systems Programmer GeneEd

#2Ian Lawrence Barwick
barwick@gmail.com
In reply to: Chris Palmer (#1)
Re: Unicode confusion

On Saturday 10 May 2003 01:47, Chris Palmer wrote:

Hello,

(...)

According to *The Java Programming Language, Third Edition* (p. 138),
"...you can use the escape sequence \uxxxx to encode Unicode characters,
where each x is a hexadecimal digit...". Therefore, shouldn't I see "262f
0b87" in the hex editor? It seems I'm not getting the same stuff out that I
am putting in. psql is not much help; it just shows wacky characters (4 of
them: "â¯à®").

Am I doing something wrong? Does something need to be set in the database
or in the JDBC Connection object? Or am I just a confused monkey?

If it's any help, your code should work as expected. The hex data you see
(3F3F0A) is two question marks and an \n; I would guess Java is not able to
display the unicode characters in your environment and is replacing them with
'?'.

PostgreSQL stores Unicode internally as UTF-8, so if you view the
data with psql in a non-unicode-environment, you will probably be
seeing the UTF-8 byte values expressed in whatever 8 bit characters
your terminal uses.

Ian Barwick
barwick@gmx.net

#3Chris Palmer
chris.palmer@geneed.com
In reply to: Ian Lawrence Barwick (#2)
Re: Unicode confusion

Ian Barwick writes:

If it's any help, your code should work as expected. The hex
data you see (3F3F0A) is two question marks and an \n; I would
guess Java is not able to display the unicode characters in your
environment and is replacing them with '?'.

What part of my environment are you referring to? It's not the terminal emulator (which Java has no knowledge of). Java does Unicode (in fact, there is no other choice). Is there some locale setting I can use? Is there a parameter I can use with the Pg JDBC driver or Connection object?

Thanks for your response.

--
Chris Palmer Systems Programmer GeneEd

#4Ian Lawrence Barwick
barwick@gmail.com
In reply to: Chris Palmer (#3)
Re: Unicode confusion

On Monday 12 May 2003 20:49, Chris Palmer wrote:

Ian Barwick writes:

If it's any help, your code should work as expected. The hex
data you see (3F3F0A) is two question marks and an \n; I would
guess Java is not able to display the unicode characters in your
environment and is replacing them with '?'.

What part of my environment are you referring to? It's not the terminal
emulator (which Java has no knowledge of). Java does Unicode (in fact,
there is no other choice). Is there some locale setting I can use? Is there
a parameter I can use with the Pg JDBC driver or Connection object?

OK, put it another way: Java is not able to or does not want to print
the specified Unicode characters (the Yin / Yang symbol and something
squiggly IIRC) to your STDOUT. This is nothing to do with the JDBC connection.
I presume Java looks at your locale setting, maybe Google knows the
answer ;-).

Using a UTF-8 capable terminal (I use mlterm or konsole, no idea what options
there are in Windows) the characters retrieved from Postgres and which are now
in Java's internal Unicode encoding (which one I don't recall) can be
displayed by converting them into UTF-8.

Ian Barwick
barwick@gmx.net

#5Chris Palmer
chris.palmer@geneed.com
In reply to: Ian Lawrence Barwick (#4)
Re: Unicode confusion

Ian Barwick writes:

OK, put it another way: Java is not able to or does not want to print
the specified Unicode characters (the Yin / Yang symbol and something
squiggly IIRC) to your STDOUT.

In the example I gave in my first post, stdout was a file. Shouldn't Java just write out bytes without trying to get smart? Especially if stdout is not a tty? But see below.

This is nothing to do with the
JDBC connection.
I presume Java looks at your locale setting, maybe Google knows the
answer ;-).

Yes, I'm searching now.

Using a UTF-8 capable terminal (I use mlterm or konsole, no
idea what options
there are in Windows) the characters retrieved from Postgres
and which are now
in Java's internal Unicode encoding (which one I don't recall) can be
displayed by converting them into UTF-8.

===
ps = new PrintStream(System.out, true, "UTF-8");
...
// this line might look strange to you if your mailer shows it differently than mine does:
s.executeUpdate("INSERT INTO test (chug) VALUES ('¤ä´©¬O¬°¤FÅý')");
s.executeUpdate("INSERT INTO test (chug) VALUES ('testing')");
s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')");
...
ps.println(rs.getString("chug"));
===

I'm no Java expert, so if that's not a good way to get UTF-8-encoded output, please let me know. When I try it, I get:

===

java Noodle > goo
cat goo

¤ä´©¬O¬°¤FÃ
ý
testing
â¯à®
===

I installed KDE on our Linux machine (the one running Java and Pg) and got the similar results using konsole. (Fwiw I am using PuTTY on Windows to connect to Linux).

===
¤ä´©¬O¬°¤FÃý
testing
â¯à®
===

Note the lack of the newline in the middle of the first result.

In either case, konsole or PuTTY, I am not getting back what I put in (the first s.executeUpdate(...), above).

In psql, the result of "select * from test" looks the same as it does when output by the Noodle Java program.

Fwiw, I do have the encoding of this database set to UNICODE:

===

psql -p 9000 -l

List of databases
Name | Owner | Encoding
-----------+------------------------+-----------
japanese | GENEEDINC+chris.palmer | EUC_JP
template0 | GENEEDINC+chris.palmer | SQL_ASCII
template1 | GENEEDINC+chris.palmer | SQL_ASCII
test | GENEEDINC+chris.palmer | SQL_ASCII
unicode | GENEEDINC+chris.palmer | UNICODE
===

I am much more confused now than I ever have been. :)

--
Chris Palmer Systems Programmer GeneEd

#6Ian Lawrence Barwick
barwick@gmail.com
In reply to: Chris Palmer (#5)
Re: Unicode confusion

On Tuesday 13 May 2003 00:35, Chris Palmer wrote:
(...)

===
ps = new PrintStream(System.out, true, "UTF-8");
...
// this line might look strange to you if your mailer shows it differently
than mine does: s.executeUpdate("INSERT INTO test (chug) VALUES
('¤ä´©¬O¬°¤FÅý')"); s.executeUpdate("INSERT INTO test (chug) VALUES
('testing')");
s.executeUpdate("INSERT INTO test (chug) VALUES ('\u262f\u0b87')");
...
ps.println(rs.getString("chug"));
===

I'm no Java expert, so if that's not a good way to get UTF-8-encoded
output, please let me know. When I try it, I get:

===

java Noodle > goo
cat goo

¤ä´©¬O¬°¤FÃ
ý
testing
â¯à®
===

I installed KDE on our Linux machine (the one running Java and Pg) and got
the similar results using konsole. (Fwiw I am using PuTTY on Windows to
connect to Linux).

===
¤ä´©¬O¬°¤FÃý
testing
â¯à®
===

Note the lack of the newline in the middle of the first result.

In either case, konsole or PuTTY, I am not getting back what I put in (the
first s.executeUpdate(...), above).

Err, yes you are. Just encoded differently (UTF-8 vs. whatever Java
uses, I would guess UCS2 or UTF16). The bytes are now getting dumped to the
display, just the display does not know that they are UTF-8. Before starting
konsole you may need to set your locale. (No idea whether putty is Unicode
capable).

In psql, the result of "select * from test" looks the same as it does when
output by the Noodle Java program.

Fwiw, I do have the encoding of this database set to UNICODE:

This is expected behaviour. Have you looked to see what encoding
Postgres uses to store Unicode?

Anyway, the obvious question is: have you tried printing the strings
you are currently passing through Postgres directly?
( ps.println('\u262f\u0b87'); ?) Do they appear any differently?

Ian Barwick
barwick@gmx.net