Unicode problem ???

Started by Priem, Alexanderabout 22 years ago18 messagesgeneral

Priem, Alexander

ap@cict.nl

about 22 years ago

Hi everyone,

I have a PostgreSQL 7.4 database running, which was initdb-ed using standard
(SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something funny. If text
containing characters like ë is inserted into the database (text field), the
inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
looks fine, including these special characters.

HOWEVER, when I get this data from within a Delphi application, using
psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
something like Ã«). It seems to be a unicode problem.

What do you think, should I re-create the database using UNICODE encoding?
I'd like to know beforehand if this will solve the problem, since I do not
want to recreate the database just to find out that it does not fix things.

Or does the problem involve psqlODBC. I looked into some of its settings,
but I can't find anything that involves character encoding...

I hope someone can give me a pointer here.

Thanks in advance,
Alexander Priem
The Netherlands.

s.vanroye@farcourier.com

about 22 years ago

In reply to: Priem, Alexander (#1)

Re: Unicode problem ???

Hi everyone,

I have a PostgreSQL 7.4 database running, which was initdb-ed
using standard
(SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something
funny. If text
containing characters like ë is inserted into the database
(text field), the
inserting goes fine. When I view this data using phpPgAdmin
(v3.2.1) it
looks fine, including these special characters.

HOWEVER, when I get this data from within a Delphi application, using
psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
something like Ã«). It seems to be a unicode problem.

What do you think, should I re-create the database using
UNICODE encoding?
I'd like to know beforehand if this will solve the problem,
since I do not
want to recreate the database just to find out that it does
not fix things.

Or does the problem involve psqlODBC. I looked into some of
its settings,
but I can't find anything that involves character encoding...

I hope someone can give me a pointer here.

Thanks in advance,
Alexander Priem
The Netherlands.

I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of encoding problems to. On the postgres backend I use UNICODE.

I'll tell you what I know, I hope it helps a bit.
First off: there seems to be a difference between the datatypes string and widestring in Delphi. widestring is recommended for use with 'wide byte strings' (or something like that), for example UNICODE. string variables however can't be trusted for the full 100% in that case. I don't know exactly how it worked, but I'm sure some browsing in the Delphi Help and other rescources will explain things.

Second: I know that in my case I have to set the CLIENT_ENCODING TO 'LATIN1' in my odbc connect string to make sure that all the characters are translated correct.

I'm sorry that I can't be more specific about these things since I don't remember exactly WHY these things are done, but I do know they help :-) This was all set-up and figured out by one of my colleagues a while ago. I was not excactly into the database itself, but restricted my work to delphi coding en general development.
Therefore I hope that someone else can shed more light onto the matter so I can learn a bit too :-)

Kind Regards,

Stijn Vanroye.

Import Notes

Resolved by subject fallback

Peter Eisentraut

peter_e@gmx.net

about 22 years ago

In reply to: Priem, Alexander (#1)

Re: Unicode problem ???

Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander:

I have a PostgreSQL 7.4 database running, which was initdb-ed using
standard (SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something funny. If text
containing characters like ï¿½ is inserted into the database (text field),
the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
looks fine, including these special characters.

You can try to update pg_database to change the database encoding to LATIN1.
SQL_ASCII is only totally supported for, well, ASCII characters.

Priem, Alexander

ap@cict.nl

about 22 years ago

In reply to: Peter Eisentraut (#3)

Re: Unicode problem ???

Hi everyone,

I have a PostgreSQL 7.4 database running, which was initdb-ed
using standard
(SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something
funny. If text
containing characters like ë is inserted into the database
(text field), the
inserting goes fine. When I view this data using phpPgAdmin
(v3.2.1) it
looks fine, including these special characters.

HOWEVER, when I get this data from within a Delphi application, using
psqlODBC (07.03.0200), the special characters get corrupted (ë becomes
something like Ã«). It seems to be a unicode problem.

What do you think, should I re-create the database using
UNICODE encoding?
I'd like to know beforehand if this will solve the problem,
since I do not
want to recreate the database just to find out that it does
not fix things.

Or does the problem involve psqlODBC. I looked into some of
its settings,
but I can't find anything that involves character encoding...

I hope someone can give me a pointer here.

Thanks in advance,
Alexander Priem
The Netherlands.

I'm a Delphi (7 Enterprise) user myself. I've ran into that sort of
encoding problems to. On the postgres backend I use UNICODE.

I'll tell you what I know, I hope it helps a bit.
First off: there seems to be a difference between the datatypes string and

widestring in Delphi. widestring is recommended for use with 'wide byte
strings' (or something like that), for example UNICODE. string variables
however can't be trusted for the full 100% in that case. I don't know
exactly how it worked, but I'm sure some browsing in the Delphi Help and
other rescources will explain things.

Second: I know that in my case I have to set the CLIENT_ENCODING TO
'LATIN1' in my odbc connect string to make sure that all the characters
are translated correct.

I'm sorry that I can't be more specific about these things since I don't
remember exactly WHY these things are done, but I do know they help :-)
This was all set-up and figured out by one of my colleagues a while ago. I

was not excactly into the database itself, but restricted my work to
delphi coding en general development.
Therefore I hope that someone else can shed more light onto the matter so
I can learn a bit too :-)

Kind Regards,

Stijn Vanroye.

I just tried something else. If I use SQL Explorer (A Borland tool for
manipulating/viewing databases through DSN's) to look at this database,
using the same (psqlODBC) DSN, I also get the wrong characters. So it seems
Delphi doesn't have anything to do with it.

On the other hand, SQL Explorer is also a Borland product...

But I also just tried to view this field as a widestring. I queried the
database and used ShowMessage(VarToWideStr(rs.Fields[1].Value)) to view the
text. Didn't help...

I also tried to change the client_encoding via SQL. I just gave a "set
client_encoding to 'unicode'" command, followed by the command to retrieve
the text. Also didn't help. "set client_encoding to 'latin1'" didn't help
either.

Is this the correct way to change the encoding or do I needd to rebuild the
database? Maybe I am not really changing the encoding this way???

Alexander Priem.

Import Notes

Resolved by subject fallback

Priem, Alexander

ap@cict.nl

about 22 years ago

In reply to: Priem, Alexander (#4)

Re: Unicode problem ???

Hi Stijn,

I tried adding "set client_encoding to 'LATIN1'" to the Connect Settings of
psqlODBC. It didn't help. I still get the same weird characters when I look
at the data using SQL explorer.

Could this be due to the fact that the database was CREATED using SQL_ASCII
encoding? Maybe your solution only works when the database was created using
LATIN1 or UNICODE encoding.

The strange thing is that everything is FINE when I use phpPgAdmin to look
at the data, so there must be a conversion problem somewhere in the ODBC
layer...

Maybe I'll just try recreating the database using UNICODE or LATIN1 encoding
tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
possibilities than UNICODE?

Groeten vanuit Veenendaal,

Alexander.

Import Notes

Resolved by subject fallback

s.vanroye@farcourier.com

about 22 years ago

In reply to: Priem, Alexander (#5)

Re: Unicode problem ???

Hi Stijn,

Hy Alexander,

I tried adding "set client_encoding to 'LATIN1'" to the
Connect Settings of
psqlODBC. It didn't help. I still get the same weird
characters when I look
at the data using SQL explorer.

Damn (pardon me). I realy hoped that would do the trick.

Could this be due to the fact that the database was CREATED
using SQL_ASCII
encoding? Maybe your solution only works when the database
was created using
LATIN1 or UNICODE encoding.

Could be, as I said I use Unicode. According to the email from Peter Eisentraut on the same topic:
//Quote
SQL_ASCII is only totally supported for, well, ASCII characters.
//End Quote
Sounds to me that SQL_ASCII is more limited then UNICODE.

The strange thing is that everything is FINE when I use
phpPgAdmin to look
at the data, so there must be a conversion problem somewhere
in the ODBC
layer...

Guess so to. I think that's a question for the developpers.
In any case there's a difference between what the ODBC does and wath phpPgAdmin does.
I've always used pgAdmin (currently III). No problems there eigther.

Maybe I'll just try recreating the database using UNICODE or
LATIN1 encoding
tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
possibilities than UNICODE?

Of what I hear, UNICODE indeed seems the best option. But then again, that encoding stuff is still a bit of a mistery to me.
What I personally don't understand is: if all my databases are UNICODE, why do I have to set the Client encoding to latin1 to get a correct result?

Groeten vanuit Veenendaal,

Alexander.

Stijn.

Import Notes

Resolved by subject fallback

John Sidney-Woollett

johnsw@wardbrook.com

about 22 years ago

In reply to: Priem, Alexander (#5)

Re: Unicode problem ???

Priem, Alexander said:

Could this be due to the fact that the database was CREATED using
SQL_ASCII
encoding? Maybe your solution only works when the database was created
using
LATIN1 or UNICODE encoding.

Yes, I suspect

Maybe I'll just try recreating the database using UNICODE or LATIN1
encoding
tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
possibilities than UNICODE?

Unicode has a larger character set than latin-1. But if you only need to
support latin-1 then use that... Of course, if you have to upgrade to
support unicode later, you'll wish you had started off using latin-1!

I'm no expert on this, so I hope this info is correct.

John Sidney-Woollett

Alexander Antonakakis

motoris@sdf.lonestar.org

about 22 years ago

In reply to: John Sidney-Woollett (#7)

Re: Unicode problem ???

I am also using postgres database with delphi. Therefor I don't use ODBC
but I use the Zeos Database components for delphi.
I had similar problems though. My database is in iso8859-7 and I needed
Greek chars to appear corectly in my application. What I did is I am
running an sql statement when my application is starting up (after my
datasource connection to the db etc) and I am setting my client's
encoding to iso8859-7. No problems since then ... Greeks show ok .. they
are stored ok and the search on the data works file.
Hope I helped a little

Alexander Antonakakis

John Sidney-Woollett wrote:

Show quoted text

Priem, Alexander said:

Could this be due to the fact that the database was CREATED using
SQL_ASCII
encoding? Maybe your solution only works when the database was created
using
LATIN1 or UNICODE encoding.

Yes, I suspect

Maybe I'll just try recreating the database using UNICODE or LATIN1
encoding
tomorrow. UNICODE seems the best, right? Or does LATIN1 have more
possibilities than UNICODE?

Unicode has a larger character set than latin-1. But if you only need to
support latin-1 then use that... Of course, if you have to upgrade to
support unicode later, you'll wish you had started off using latin-1!

I'm no expert on this, so I hope this info is correct.

John Sidney-Woollett

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

hf320@protecting.net

about 22 years ago

In reply to: Priem, Alexander (#1)

Re: Unicode problem ???

In article <200404211522.39036.peter_e@gmx.net>,
Peter Eisentraut <peter_e@gmx.net> writes:

Am Mittwoch, 21. April 2004 14:37 schrieb Priem, Alexander:

I have a PostgreSQL 7.4 database running, which was initdb-ed using
standard (SQL_ASCII) encoding, with -lc-collate=C option set.

Everything is running fine, but I just discovered something funny. If text
containing characters like ë is inserted into the database (text field),
the inserting goes fine. When I view this data using phpPgAdmin (v3.2.1) it
looks fine, including these special characters.

You can try to update pg_database to change the database encoding to LATIN1.
SQL_ASCII is only totally supported for, well, ASCII characters.

Or even better to LATIN9. This also includes the Euro symbol and the
French OE ligature forgotten in LATIN1.

bnl@tiscali.se

about 22 years ago

In reply to: Stijn Vanroye (#6)

Re: Unicode problem ???

Stijn Vanroye wrote:

Of what I hear, UNICODE indeed seems the best option. But then again, that
encoding stuff is still a bit of a mistery to me. What I personally don't

The following is cut from the documentation of XML Ada
(xmlada-1.0/docs/xml_2.html#SEC6)
( which is available at http://libre.act-europe.fr/xmlada/)

<quote>
We now know how each encoded character can be represented by an integer
value (code point) depending on the character set.

Character encoding schemes deal with the representation of a sequence of
integers to a sequence of code units. A code unit is a sequence of bytes on
a computer architecture.

There exists a number of possible encoding schemes. Some of them encode all
integers on the same number of bytes. They are called fixed-width encoding
forms, and include the standard encoding for Internet emails (7bits, but it
can't encode all characters), as well as the simple 8bits scheme, or the
EBCDIC scheme. Among them is also the UTF-32 scheme which is defined in the
Unicode standard.

Another set of encoding schemes encode integers on a variable number of
bytes. These include two schemes that are also defined in the Unicode
standard, namely Utf-8 and Utf-16.

Unicode doesn't impose any specific encoding. However, it is most often
associated with one of the Utf encodings. They each have their own
properties and advantages:

Utf32
This is the simplest of all these encodings. It simply encodes all the
characters on 32 bits (4 bytes). This encodes all the possible characters
in Unicode, and is obviously straightforward to manipulate. However, given
that the first 65535 characters in Unicode are enough to encode all known
languages currently in use, Utf32 is also a waste of space in most cases.

Utf16
For the above reason, Utf16 was defined. Most characters are only encoded on
two bytes (which is enough for the first 65535 and most current
characters). In addition, a number of special code points have been
defined, known as surrogate pairs, that make the encoding of integers
greater than 65535 possible. The integers are then encoded on four bytes.
As a result, Utf16 is thus much more memory-efficient and requires less
space than Utf32 to encode sequences of characters. However, it is also
more complex to decode.

Utf8
This is an even more space-efficient encoding, but is also more complex to
decode. More important, it is compatible with the most currently used
simple 8bit encoding.

Utf8 has the following properties:

Characters 0 to 127 (ASCII) are encoded simply as a single byte. This means
that files and strings which contain only 7-bit ASCII characters have the
same encoding under both ASCII and UTF-8.

Characters greater than 127 are encoded as a sequence of several bytes, each
of which has the most significant bit set. Therefore, no ASCII byte can
appear as part of any other character.

The first byte of a multibyte sequence that represents a non-ASCII character
is always in the range 0xC0 to 0xFD and it indicates how many bytes follow
for this character. All further bytes in a multibyte sequence are in the
range 0x80 to 0xBF. This allows easy resynchronization and makes the
encoding stateless and robust against missing bytes.

UTF-8 encoded characters may theoretically be up to six bytes long, however
the first 16-bit characters are only up to three bytes long.

Note that the encodings above, except for Utf8, have two versions, depending
on the chosen byte order on the machine.

</quote>

So yes, Unicode in Utf8 is tricky to handle
/Bjï¿½rn

Karsten Hilbert

Karsten.Hilbert@gmx.net

about 22 years ago

In reply to: Stijn Vanroye (#6)

Re: Unicode problem ???

What I personally don't understand is: if all my databases
are UNICODE, why do I have to set the Client encoding to latin1
to get a correct result?

Because LATIN1 isn't just a subset of UNICODE. If the data
coming out of the database *is* UNICODE *and* your client
*does* handle UNICODE directly you might get away with not
setting a client_encoding, same goes for DB=latin1 +
client=latin1. If, however, the DB delivers UNICODE but your
client really wants LATIN1 you need to tell the database to
convert the stored UNICODE to LATIN1 before delivery. That's
what the client_encoding is for.

Or so is my understanding of it.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

Priem, Alexander

ap@cict.nl

about 22 years ago

In reply to: Karsten Hilbert (#11)

Re: Unicode problem ???

So if I understand correctly, I could best try the following :

* Create a new database, using UNICODE encoding, since this supports the
largest character set (inc. Euro-sign etc.)

* Then, set the client-encoding to LATIN9, so that clients get support for
special dutch characters. The client-encoding can also be changed
on-the-fly, so it won't matter if I don't get this right the first time.
Most important is the database (cluster) encoding...

Am I right in this?

Thanks for all the help.
Alexander Priem.

Import Notes

Resolved by subject fallback

s.vanroye@farcourier.com

about 22 years ago

In reply to: Priem, Alexander (#12)

Re: Unicode problem ???

What I personally don't understand is: if all my databases
are UNICODE, why do I have to set the Client encoding to latin1
to get a correct result?

Karsten Hilbert wrote:

Because LATIN1 isn't just a subset of UNICODE. If the data
coming out of the database *is* UNICODE *and* your client
*does* handle UNICODE directly you might get away with not
setting a client_encoding, same goes for DB=latin1 +
client=latin1. If, however, the DB delivers UNICODE but your
client really wants LATIN1 you need to tell the database to
convert the stored UNICODE to LATIN1 before delivery. That's
what the client_encoding is for.

Or so is my understanding of it.

Karsten

Thanks, that's a pretty clear explication.
Only one question:
Wouldn't it be better if I just set my client encoding to UNICODE in stead of LATIN1? I suppose the UNICODE encoding set is understood by windows (and delphi), since I write progs for a win enviroment.

Stijn.

Import Notes

Resolved by subject fallback

Peter Eisentraut

peter_e@gmx.net

about 22 years ago

In reply to: Stijn Vanroye (#13)

Re: Unicode problem ???

Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:

Wouldn't it be better if I just set my client encoding to UNICODE in stead
of LATIN1? I suppose the UNICODE encoding set is understood by windows (and
delphi), since I write progs for a win enviroment.

That really depends exclusively on what your client program/environment can
tolerate. You may have to set your console, widget set or desktop to Unicode
or something like that.

s.vanroye@farcourier.com

about 22 years ago

In reply to: Peter Eisentraut (#14)

Re: Unicode problem ???

Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:

Wouldn't it be better if I just set my client encoding to

UNICODE in stead

of LATIN1? I suppose the UNICODE encoding set is understood

by windows (and

delphi), since I write progs for a win enviroment.

Peter Eisentraut wrote:

That really depends exclusively on what your client
program/environment can
tolerate. You may have to set your console, widget set or
desktop to Unicode
or something like that.

Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE.

Stijn.

Import Notes

Resolved by subject fallback

Priem, Alexander

ap@cict.nl

about 22 years ago

In reply to: Stijn Vanroye (#15)

Re: Unicode problem ???

Hi everyone,

I solved the problem with the special characters (thanks to you all).

I recreated my entire database (luckily I keep scripts for table/index/view
creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my
psqlODBC DSN settings I added "set client_encoding='LATIN9';" to the Connect
Settings and that solved all my problems regarding the special characters.

Thanks a lot, everyone!

Alexander Priem.

Import Notes

Resolved by subject fallback

hf320@protecting.net

about 22 years ago

In reply to: Stijn Vanroye (#15)

Re: Unicode problem ???

In article <71E201BE5E881C46811BA160694C5FCB0FA934@fs1000.farcourier.com>,
"Stijn Vanroye" <s.vanroye@farcourier.com> writes:

Am Donnerstag, 22. April 2004 09:48 schrieb Stijn Vanroye:

Wouldn't it be better if I just set my client encoding to

UNICODE in stead

of LATIN1? I suppose the UNICODE encoding set is understood

by windows (and

delphi), since I write progs for a win enviroment.

Peter Eisentraut wrote:

That really depends exclusively on what your client
program/environment can
tolerate. You may have to set your console, widget set or
desktop to Unicode
or something like that.

Ok thanks. All I've got to do now is find out if windows and/or delphi understand UNICODE.

I'd say you should first decide if you really need Unicode. If you're
dealing exclusively with English/French/Spanish/German or so and if
you're pretty sure you'll never touch Polish/Russian/Chinese, you can
stick to Latin-1 or Latin-9 and happily ignore Unicode.

Priem, Alexander

ap@cict.nl

about 22 years ago

In reply to: Harald Fuchs (#17)

Re: Unicode problem ???

I'd say you should first decide if you really need Unicode. If you're
dealing exclusively with English/French/Spanish/German or so and if
you're pretty sure you'll never touch Polish/Russian/Chinese, you can
stick to Latin-1 or Latin-9 and happily ignore Unicode.

But wouldn't it be better to pick unicode just in case? Or would Latin-1 /
Latin-9 perform better, compared to Unicode?

Alexander Priem

Import Notes

Resolved by subject fallback