Re: BUG #1775: UTF8 to ISO does not convert the german "sharp s" (ß )

Started by Magnus Haganderover 20 years ago14 messagesbugs
Jump to latest
#1Magnus Hagander
magnus@hagander.net

Bug reference: 1775
Logged by: Marcus Raphelt
Email address: marcus@raphelt.de
PostgreSQL version: latest
Operating system: WinXP (dev), Win2000 (server)
Description: UTF8 to ISO does not convert the german "sharp s"
(ß)
Details:

Hi,

I currently have a iso_8859_1 app and included the opengeodb
tables to have a set of german cities and zipcodes.

http://opengeodb.sourceforge.net/index.php?FrontPage_en
I downloaded this dump:

http://prdownloads.sourceforge.net/opengeodb/opengeodb-0.2.3c-
UTF8-postgres.
zip?download

They offer a Postgresql dump that can be imported without
hassle, but is in UTF8. I did several attempts using
convert() to convert the city names to iso. It works, and it
also works fine on german umlauts (äöü), but it seems like
convert() does NOT convert the german "ß" / sharp s / ß...

PostgreSQL 8.0 does not support UTF8 on Win32. You will need to convert the file before you load it into the database - look at for example the GNU iconv program to do this.

//Magnus

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Magnus Hagander (#1)

Magnus Hagander wrote:

PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
convert the file before you load it into the database - look at for
example the GNU iconv program to do this.

PostgreSQL does not support UTF8 locales on Windows, but the character
conversion should work the same on all platforms, shouldn't it?

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#3Magnus Hagander
magnus@hagander.net
In reply to: Peter Eisentraut (#2)

PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
convert the file before you load it into the database - look at for
example the GNU iconv program to do this.

PostgreSQL does not support UTF8 locales on Windows, but the
character conversion should work the same on all platforms,
shouldn't it?

I thought that was also handled in the locale code. But now that you say
it, it's probably not - it should be the same.
I'll withdraw my comment then, clearly wrong. Thanks for clearing that
up.

//Magnus

#4Marcus Raphelt
marcus@raphelt.de
In reply to: Peter Eisentraut (#2)

Hi,

PE> PostgreSQL does not support UTF8 locales on Windows, but the character
PE> conversion should work the same on all platforms, shouldn't it?

It works pretty fine even under win32, all umlauts are coverted as
expexted, except for the "ᅵ" (ß).
For example, "Köln" converts to "Kᅵln", but ""Groß Kreutz" results
in an empty string.

--
Mit freundlichen Grᅵssen
Marcus Raphelt mailto:marcus@raphelt.de

#5John Hansen
john@geeknet.com.au
In reply to: Marcus Raphelt (#4)

Magnus Hagander Wrote:

PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
convert the file before you load it into the database -

look at for

example the GNU iconv program to do this.

PostgreSQL does not support UTF8 locales on Windows, but

the character

conversion should work the same on all platforms, shouldn't it?

I thought that was also handled in the locale code. But now
that you say it, it's probably not - it should be the same.
I'll withdraw my comment then, clearly wrong. Thanks for
clearing that up.

Currently, it is very much so handled by the systems locale support,
hence it is handled differently depending on platform.

ICU support should make it to 8.1 I hear, which will make the bahaviour
uniform.

... John

Show quoted text

//Magnus

---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend

#6Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#2)
Re: BUG #1775: UTF8 to ISO does not convert the german

Magnus Hagander wrote:

PostgreSQL 8.0 does not support UTF8 on Win32. You will need to
convert the file before you load it into the database - look at for
example the GNU iconv program to do this.

PostgreSQL does not support UTF8 locales on Windows, but the character
conversion should work the same on all platforms, shouldn't it?

Right.

BTW the original bug report sounds interesting to me. The conversion
between UTF-8 and ISO 8859-1 is handled by a program logic, not by
conversion tables. So if german umlauts are converted fine, there's no
reason the conversion for german sharp s does not work.

Marcus,

Can you give me the exact error message from PostgreSQL when the
conversio failed?
--
Tatsuo Ishii

#7Marcus Raphelt
marcus@raphelt.de
In reply to: Tatsuo Ishii (#6)

Hi Tatsuo,

Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie:

TI> conversion tables. So if german umlauts are converted fine, there's no
TI> reason the conversion for german sharp s does not work.

TI> Marcus,

TI> Can you give me the exact error message from PostgreSQL when the
TI> conversio failed?

Well - actually, there IS no error message, convert() just returns an
empty sting. In detail, here's what I did:

-Set up PostgreSQL (Ascii)

-Imported the opengeodb
http://sourceforge.net/project/showfiles.php?group_id=132421
As they offer a PostgreSQL dump, I chose this one.

-Imported the dump using pgAdminIII

-Created a view that returns all german cities with ZIP and
opengeodb-locationID:

CREATE OR REPLACE VIEW orte_de AS
SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort
FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code
WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text
AND state.text_type = 500100001 AND town.loc_id = hi.loc_id
AND town.text_type = 500100000 AND code.loc_id = town.loc_id
AND code.text_type = 500300000;

So now I've got three columns: "plz" (zip), "ort" (city) and loc_id.
Assuming I want to retrieve cites in the Hamburg / Hannover area...
(perfect for this task, as they have pretty strange city names there
:)

SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
ort_conv
from orte_de
where plz between 20000 and 30000
order by ort_conv

This query returns empty values for "ort_conv" if "ort" contains a
sharp s.
Btw, it seems like it is the same for "ᅵ" (Ä), have a look at
loc_id 25182.

--
Mit freundlichen Grᅵssen
Marcus Raphelt mailto:marcus@raphelt.de

#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Marcus Raphelt (#7)
Re: BUG #1775: UTF8 to ISO does not convert the german

Hi Tatsuo,

Am Mittwoch, 20. Juli 2005 um 01:00 schrieben Sie:

TI> conversion tables. So if german umlauts are converted fine, there's no
TI> reason the conversion for german sharp s does not work.

TI> Marcus,

TI> Can you give me the exact error message from PostgreSQL when the
TI> conversio failed?

Well - actually, there IS no error message, convert() just returns an
empty sting. In detail, here's what I did:

-Set up PostgreSQL (Ascii)

-Imported the opengeodb
http://sourceforge.net/project/showfiles.php?group_id=132421
As they offer a PostgreSQL dump, I chose this one.

-Imported the dump using pgAdminIII

-Created a view that returns all german cities with ZIP and
opengeodb-locationID:

CREATE OR REPLACE VIEW orte_de AS
SELECT code.text_val AS plz, code.loc_id, town.text_val AS ort
FROM geodb_hierarchies hi, geodb_textdata state, geodb_textdata town, geodb_textdata code
WHERE hi.id_lvl2 = state.loc_id AND state.text_val = 'DE'::text
AND state.text_type = 500100001 AND town.loc_id = hi.loc_id
AND town.text_type = 500100000 AND code.loc_id = town.loc_id
AND code.text_type = 500300000;

So now I've got three columns: "plz" (zip), "ort" (city) and loc_id.
Assuming I want to retrieve cites in the Hamburg / Hannover area...
(perfect for this task, as they have pretty strange city names there
:)

SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
ort_conv
from orte_de
where plz between 20000 and 30000
order by ort_conv

This query returns empty values for "ort_conv" if "ort" contains a
sharp s.
Btw, it seems like it is the same for "���" (Ä), have a look at
loc_id 25182.

I see no problem with encoding conversion itself:

utf8=# \encoding latin1
utf8=# select * from t1;
t
----

(1 row)

SELECT plz, loc_id, ort, convert(ort using utf_8_to_iso_8859_1) as
ort_conv
from orte_de
where plz between 20000 and 30000
order by ort_conv

Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
is ISO-8859-1 but your database encoding is UTF-8, so the terminal
treats ISO-8859-1 chars as UTF-8 which will result in unexpected
characters displayed.

I guess what you want to do is:

SELECT plz, loc_id, ort from orte_de
where plz between 20000 and 30000
order by convert(ort using utf_8_to_iso_8859_1)
--
Tatsuo Ishii

#9Marcus Raphelt
marcus@raphelt.de
In reply to: Tatsuo Ishii (#8)

Hi,

Am Mittwoch, 20. Juli 2005 um 10:27 schrieben Sie:

TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
TI> is ISO-8859-1 but your database encoding is UTF-8,

Not exactly, my database is in iso_8859_1, only the opengeodb-dump is
solely available in utf-8.
The whole thing is no problem so far, as I can convert the values in
the application that uses the data - I just wanted to let you know.

TI> I guess what you want to do is:
TI> SELECT plz, loc_id, ort from orte_de
TI> where plz between 20000 and 30000
TI> order by convert(ort using utf_8_to_iso_8859_1)

Well, that was a fantasy-query :) just to show some "odd" values, I do
not need that much data in the application. Currently, I cannot
retrieve ready-converted iso data from the orte_de view, as converted
names that contain "ß" are not shown...

--
Mit freundlichen Grᅵssen
Marcus Raphelt mailto:marcus@raphelt.de

#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Marcus Raphelt (#9)
Re: BUG #1775: UTF8 to ISO does not convert the german

TI> Problem here is the result of convert(ort using utf_8_to_iso_8859_1)
TI> is ISO-8859-1 but your database encoding is UTF-8,

Not exactly, my database is in iso_8859_1, only the opengeodb-dump is
solely available in utf-8.
The whole thing is no problem so far, as I can convert the values in
the application that uses the data - I just wanted to let you know.

That's surprising. Since you try to convert from UTF-8 to ISO 8859-1,
I thought you store data as UTF-8.

TI> I guess what you want to do is:
TI> SELECT plz, loc_id, ort from orte_de
TI> where plz between 20000 and 30000
TI> order by convert(ort using utf_8_to_iso_8859_1)

Well, that was a fantasy-query :) just to show some "odd" values, I do
not need that much data in the application. Currently, I cannot
retrieve ready-converted iso data from the orte_de view, as converted
names that contain "ß" are not shown...

Let me make sure. You store UTF-8 data in ISO 8859-1 DB?
--
Tatsuo Ishii

#11Bruno Wolff III
bruno@wolff.to
In reply to: John Hansen (#5)

On Wed, Jul 20, 2005 at 08:35:33 +1000,
John Hansen <john@geeknet.com.au> wrote:

ICU support should make it to 8.1 I hear, which will make the bahaviour
uniform.

There was some discussion about that, but my impression was that that
wasn't going to happen.

#12Magnus Hagander
magnus@hagander.net
In reply to: Bruno Wolff III (#11)

ICU support should make it to 8.1 I hear, which will make the
bahaviour uniform.

There was some discussion about that, but my impression was
that that wasn't going to happen.

AFAIK the result was that it would go in as an alternative, but not to
*replace* what's there now.

//Magnus

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#12)

"Magnus Hagander" <mha@sollentuna.net> writes:

ICU support should make it to 8.1 I hear, which will make the
bahaviour uniform.

There was some discussion about that, but my impression was
that that wasn't going to happen.

AFAIK the result was that it would go in as an alternative, but not to
*replace* what's there now.

The whole issue is still TBD as far as I know --- there were arguments
on both sides but not consensus, and if any committer has reviewed the
ICU patch in any detail, I didn't hear about it.

It'd probably be easier to get consensus for a "make it a compile-time
option" proposal than for a "wholesale replacement" proposal.

regards, tom lane

#14Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#13)

ICU support should make it to 8.1 I hear, which will make the
bahaviour uniform.

There was some discussion about that, but my impression

was that that

wasn't going to happen.

AFAIK the result was that it would go in as an alternative,

but not to

*replace* what's there now.

The whole issue is still TBD as far as I know --- there were
arguments on both sides but not consensus, and if any
committer has reviewed the ICU patch in any detail, I didn't
hear about it.

Naturally it has to pass the standard requirements for an acceptable
patch. But IIRC the early version was checked over (I think by Bruce).
Not in details, but as an overview. I fully realise that doesn't mean
it's approved, but it's a step ;-)

It'd probably be easier to get consensus for a "make it a
compile-time option" proposal than for a "wholesale
replacement" proposal.

My point exactly. Considering it's vital for e.g. win32 to get unicode,
and apparantly pretty good for freebsd people if I read Palle right,
whereas other platforms have what they need already...

//Magnus