UTF-8 encoding question regarding PhpPgAdmin development

Started by Jean-Michel POUREabout 23 years ago6 messages
#1Jean-Michel POURE
jm.poure@freesurf.fr

Dear all,

We are working on PhpPgAdmin UTF-8 support. I would like to be able to view
UTF-8, ASCII and Latin1 databases in PhpPgAdmin without changing HTML header
encodings.

I guess this can be done using:
SET CLIENT_ENCODING='Unicode'
for all PhpPgAdmin connections.

My question are:

- Are some database encodings not translatable into UTF-8 using SET
CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has
been fixed now.

- Some letters, like the euro sign, do not belong to Latin1. Example: let's
say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
input a euro sign, does it get rejected by PostgreSQL?

- More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese
multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte
encodings be converted into/from UTF-8 safely?

Best regards,
Jean-Michel

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Jean-Michel POURE (#1)
Re: UTF-8 encoding question regarding PhpPgAdmin development

Jean-Michel POURE writes:

- Are some database encodings not translatable into UTF-8 using SET
CLIENT_ENCODING = 'Unicode'. It used to be the case for Latin1, but it has
been fixed now.

It should be possible. If not, it's a bug.

- Some letters, like the euro sign, do not belong to Latin1. Example: let's
say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
input a euro sign, does it get rejected by PostgreSQL?

Currently, it gives you a warning and ignores the character. Not sure
that is ideal.

- More generaly, is it safe to convert an Encoding (ex: Latin1 or Chinese
multi-byte) into UTF-8 using SET CLIENT_ENCODING? Can all multi-byte
encodings be converted into/from UTF-8 safely?

Some points to keep in mind: Some character sets contain characters that
are not in Unicode, although you might choose to ignore that fact because
it is of relatively minor importance. Round-trip conversion is not safely
possible, so if your tool provides a read/edit/write tool then you will
have problems. Finally, when you display East Asian characters you will
have a font problem because the Chinese, Japanese, and Korean characters
are mapped to the same range in Unicode but you are supposed to use
country-specific glyphs.

In short, I don't think what you are trying to do is easily achievable.

--
Peter Eisentraut peter_e@gmx.net

#3Jean-Michel POURE
jm.poure@freesurf.fr
In reply to: Peter Eisentraut (#2)
Re: UTF-8 encoding question regarding PhpPgAdmin development

Dear Peter,

Thank you very much for your answers. It rings a bell.

Finally, when you display East Asian characters you will
have a font problem because the Chinese, Japanese, and Korean characters
are mapped to the same range in Unicode but you are supposed to use
country-specific glyphs.

Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If
it is really the case, we cannot use UTF-8.

Round-trip conversion is not safely possible, so if your tool provides a
read/edit/write tool then you will have problems.

Maybe we could use "getdatabaseencoding()" to determine the dabase encoding
and generate HTML pages with the corresponding headers. Example: Latin1
database <-> ISOS-8859-1 headers.

The problem is that PhpPgAdmin interface needs to be localized in several
languages, not related to database encoding. Example: EUC_JP interface and
Latin1 databases.

Maybe a solution would be to use the ISO 10646 notation for PhpPgAdmin
interface localization: "&#XH;", where H is a hexadecimal number.

Cheers,
Jean-MIchel POURE

In reply to: Peter Eisentraut (#2)
Re: UTF-8 encoding question regarding PhpPgAdmin

On Tue, 2003-01-07 at 21:59, Peter Eisentraut wrote:

- Some letters, like the euro sign, do not belong to Latin1. Example: let's
say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
input a euro sign, does it get rejected by PostgreSQL?

Currently, it gives you a warning and ignores the character. Not sure
that is ideal.

(Yes, I should try this myself...)

Ignored as in 'passed through unchanged'; or ignored as in 'removed from
the string'?

cheers
-- vbi

--
this email is protected by a digital signature: http://fortytwo.ch/gpg

#5Peter Eisentraut
peter_e@gmx.net
In reply to: Jean-Michel POURE (#3)
Re: UTF-8 encoding question regarding PhpPgAdmin development

Jean-Michel POURE writes:

Finally, when you display East Asian characters you will
have a font problem because the Chinese, Japanese, and Korean characters
are mapped to the same range in Unicode but you are supposed to use
country-specific glyphs.

Do you mean that glyph hexaX will display differently in UTF-8 and EUC_JP? If
it is really the case, we cannot use UTF-8.

Well, it's not completely different, but customized to the language. The
Chinese, Japanese, and Korean ideographs are really the same historically
but are displayed slightly differently. If you use a country-specific
character set you probably also get a country-specific font with it, but
if you map it to Unicode then you will get whatever the default look is on
your computer. This is actually not so bad because as I understand it,
for example, a Japanese book that quotes Chinese text uses the
Japanese-look ideographs for the Chinese portions as well. But a database
administration tool is not a Japanese book, so you need to judge it.

--
Peter Eisentraut peter_e@gmx.net

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Adrian 'Dagurashibanipal' von Bidder (#4)
Re: UTF-8 encoding question regarding PhpPgAdmin

- Some letters, like the euro sign, do not belong to Latin1. Example: let's
say we have a Latin1 database and use SET CLIENT_ENCODING = 'Unicode'. If I
input a euro sign, does it get rejected by PostgreSQL?

Currently, it gives you a warning and ignores the character. Not sure
that is ideal.

(Yes, I should try this myself...)

Ignored as in 'passed through unchanged'; or ignored as in 'removed from
the string'?

"removed from the string". BTW, if I remember correctly, the euro sign is
supported in ISO-8859-16, not in ISO-8859-1.
--
Tatsuo Ishii