'order by' does "wrong" with unicode-chars (german umlauts)

Started by Peter Pilslover 22 years ago8 messagesgeneral
Jump to latest
#1Peter Pilsl
pilsl@goldfisch.at

postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via webinterfaces,
processed with perl and then stored in postgresql (and viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: �) are stored as doublechars ("�" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

�sterreich
America
Pakistan

instead of

Amerika
�sterreich
Pakistan

How to deal with this Problem ? Of course converting to latin before storing
would be a solution but we plan to offer support for many non-latin languages
later and the meaning of unicode is to get rid of all this converting-stuff
after all.

thnx,
peter

--
IT-Consulting
mag. peter pilsl
tel:+43-699-1-3574035
fax:+43-699-4-3574035
pilsl@goldfisch.at
http://www.goldfisch.at

#2Richard Huxton
dev@archonet.com
In reply to: Peter Pilsl (#1)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

On Friday 19 September 2003 13:11, peter pilsl wrote:

postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via
webinterfaces, processed with perl and then stored in postgresql (and
viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

Österreich
America
Pakistan

instead of

Amerika
Österreich
Pakistan

I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.

How to deal with this Problem ? Of course converting to latin before
storing would be a solution but we plan to offer support for many non-latin
languages later and the meaning of unicode is to get rid of all this
converting-stuff after all.

What sorting-order do you want? You probably have options like: C, en_GB, de,
it etc.
If you are storing non-latin1 symbols as well as latin1, I can't think of what
a reasonable sort order would be.

Unfortunately, PG only supports one locale at a time, and gets set during
initdb. See the chapter on Localisation in the manuals for details.

--
Richard Huxton
Archonet Ltd

#3Peter Pilsl
pilsl@goldfisch.at
In reply to: Richard Huxton (#2)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.

I dont think that this is my problem.

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using
the lower-function on it I dont get a german-umlaut-o (lowercase) at all.
Only the first byte is converted to lowercase and the second is left untouched,
while in "real" unicode-lowercasing the first byte would stay untouched and the
second would change.
I still dont know how to tell postgres that the data it receives is unicode and
not just "singlebyte".

I'll rethink my problem and post a somehow more precise question to the mainlist
then, but any comments to shorten and improve my rethinking are highly welcome.

thnx,
peter

#4Richard Huxton
dev@archonet.com
In reply to: Peter Pilsl (#3)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

On Saturday 20 September 2003 13:56, peter pilsl wrote:

I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the
postgresql.conf file for details of what your current settings are.

I dont think that this is my problem.

Sorry - looks like the sorting part of your question threw me off track.

I get my text from a web-form, process it via perl and store it in
postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl
and I got the suspect that postgresql simply takes this multibyte-text and
doesnt even reckognize that it could be unicode.

Could be the case - try "show client_encoding" in psql to see what encoding
you are using.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it
using the lower-function on it I dont get a german-umlaut-o (lowercase) at
all. Only the first byte is converted to lowercase and the second is left
untouched, while in "real" unicode-lowercasing the first byte would stay
untouched and the second would change.
I still dont know how to tell postgres that the data it receives is unicode
and not just "singlebyte".

If it turns out you want to change encoding to multibyte, I think you'll need
to dump an initdb again. See the chapter on localization - multi-byte
encodings for details.

--
Richard Huxton
Archonet Ltd

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Pilsl (#3)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

peter pilsl <pilsl@goldfisch.at> writes:

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane

#6Peter Pilsl
pilsl@goldfisch.at
In reply to: Tom Lane (#5)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

It makes no difference if I use a database with encoding unicode:

# \l
List of databases
Name | Owner | Encoding
--------------------+--------------+-----------
<skip>
test | peter | SQL_ASCII
unicode | peter | UNICODE
unicode2 | peter | LATIN1

I tried with all these databases. The problem stays exactely the same. The
german umlaut-O appears as double-byte "ö" and when ordering it appears between
"A" and "B", cause imho only the first byte "�" is taking into account.

I still have no idea, if there is a problem outside postgreSQL (in perl or maybe
in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious
debugging cause I dont know how to insert "real" unicode into postgres ... Any
debugging-hints welcome also :)

Dealing with the UNICODE-database raises new problems:
unicode=# insert into test values ('�sterreich');
ERROR: Unicode >= 0x10000 is not supported
unicode=# show client_encoding;
client_encoding
-----------------
UNICODE
(1 row)

thnx,
peter

Quoting Tom Lane <tgl@sss.pgh.pa.us>:

Show quoted text

peter pilsl <pilsl@goldfisch.at> writes:

I get my text from a web-form, process it via perl and store it in

postgreSQL

via DBI-Interface. The unicode-text appears as multibyte in perl and I got

the

suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane

#7Andreas Hinz
news3@winopticc.dk
In reply to: Peter Pilsl (#6)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

Sat, 20 Sep 2003 18:39:35 +0200
skrev pilsl@goldfisch.at (peter pilsl):

It makes no difference if I use a database with encoding unicode:

ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

List of databases
Name | Owner | Encoding
-----------+----------+-----------
ah | ah | SQL_ASCII

ah=# show client_encoding;

client_encoding
-----------------
SQL_ASCII
(1 row)

CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('Æ');
INSERT INTO test VALUES ('Å');
INSERT INTO test VALUES ('Ø');
INSERT INTO test VALUES ('Ä');
INSERT INTO test VALUES ('Ö');
INSERT INTO test VALUES ('Ü');

SELECT * FROM test ORDER BY f1;

t
----
A
B
Ü
Æ
Ä
Ø
Ö
Å
AA
(9 rows)

Looks OK to me ;-)

#8Kathy Zhu
Kathy.Zhu@Sun.COM
In reply to: Andreas Hinz (#7)
Re: 'order by' does "wrong" with unicode-chars (german umlauts)

I think it is an expected result.
As far as I understand it, LC_ALL determines the collation/sorting rule, and
unicode allows you to store and retrieve multibytes character which has nothing
to do with the soring rules.

kathy

X-Original-To: pgsql-general-postgresql.org@localhost.postgresql.org
Date: Sat, 20 Sep 2003 22:53:54 +0200
From: Andreas Hinz <news3@winopticc.dk>
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] 'order by' does "wrong" with unicode-chars (german

umlauts)

X-Virus-Scanned: by amavisd-new at postgresql.org
X-Mailing-List: pgsql-general
X-AntiAbuse: This header was added to track abuse, please include it with any

abuse report

X-AntiAbuse: Primary Hostname - noon.pghoster.com
X-AntiAbuse: Original Domain - sun.com
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - postgresql.org
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by amon.Central.Sun.COM id

h8M2O1c23732

Show quoted text

Sat, 20 Sep 2003 18:39:35 +0200
skrev pilsl@goldfisch.at (peter pilsl):

It makes no difference if I use a database with encoding unicode:

ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

List of databases
Name | Owner | Encoding
-----------+----------+-----------
ah | ah | SQL_ASCII

ah=# show client_encoding;

client_encoding
-----------------
SQL_ASCII
(1 row)

CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('Æ');
INSERT INTO test VALUES ('Å');
INSERT INTO test VALUES ('Ø');
INSERT INTO test VALUES ('Ä');
INSERT INTO test VALUES ('Ö');
INSERT INTO test VALUES ('Ü');

SELECT * FROM test ORDER BY f1;

t
----
A
B
Ü
Æ
Ä
Ø
Ö
Å
AA
(9 rows)

Looks OK to me ;-)

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings