Sorting Problem in UNICODE/german

Started by Klaus Itaover 20 years ago6 messagesbugs
Jump to latest
#1Klaus Ita
postgres@stro.at

Hi there!

I have a Problem with a DB that was created in UNICODE

* createdb -E UNICODE

and actually shows that it _is_ in UNICODE.
i was able to input data and can read it and everything is fine.
but when i want to "ORDER BY ..." it does not sort the german Umlauts at the
correct postition.

should be:
m n o ��� p

and is:
��� a b c d

I have tried starting postgres with LC_ALL=de_AT.utf8@euro
locale but that did not help.

what xould i do?

regs,
klaus

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Klaus Ita (#1)
Re: Sorting Problem in UNICODE/german

Klaus Ita <postgres@stro.at> writes:

I have tried starting postgres with LC_ALL=de_AT.utf8@euro
locale but that did not help.

You need to run initdb under that setting. See "Localization" in
the documentation.

regards, tom lane

#3Klaus Ita
postgres@stro.at
In reply to: Tom Lane (#2)
Re: Sorting Problem in UNICODE/german

On Thu, Sep 01, 2005 at 09:30:15AM -0400, Tom Lane wrote:

Klaus Ita <postgres@stro.at> writes:

I have tried starting postgres with LC_ALL=de_AT.utf8@euro
locale but that did not help.

i did read the docs and am still not quite happy with my sorting results.
ok initdb has been rerun

made sure, i had the locale:

locale -a

created new db-cluster with
LC_ALL=de_AT.utf8@euro initdb --locale=de_AT.utf8@euro -E UNICODE -D /dev/shm/pgutf8

and then still the sorting was not right when i restored another UNICODE db.

another "funny" thing is:

ita@aipc54:~/.mutt$ LC_ALL=de_AT.utf8@euro sort /tmp/testfile
Abend
Oma
���terreich
���erflieger
Unter
Zetrix

this is also wrong (There should be 'Unter' and then 'U:berflieger' [���berflieger]). so is this a libc bug?

thank you for your help so far! I more than appreciate it. Support for this
DB is sooo much better than for oracle!

klaus

#4Andreas Seltenreich
seltenreich@gmx.de
In reply to: Klaus Ita (#3)
Re: Sorting Problem in UNICODE/german

Klaus Ita schrob:

On Thu, Sep 01, 2005 at 09:30:15AM -0400, Tom Lane wrote:

Klaus Ita <postgres@stro.at> writes:

I have tried starting postgres with LC_ALL=de_AT.utf8@euro
locale but that did not help.

i did read the docs and am still not quite happy with my sorting results.
ok initdb has been rerun

made sure, i had the locale:

locale -a

created new db-cluster with
LC_ALL=de_AT.utf8@euro initdb --locale=de_AT.utf8@euro -E UNICODE -D /dev/shm/pgutf8

and then still the sorting was not right when i restored another
UNICODE db.

Well, I used the very same command with 8.0.3 to create a database,
and the sort order was correct:

--8<---------------cut here---------------start------------->8---
scratch=# select w from w order by w;
w
-------------
Abend
Oma
�sterreich
�berflieger
Unter
Zetrix
(6 rows)
--8<---------------cut here---------------end--------------->8---

So I guess there was some misconfiguration of your current
client_encoding during import, or maybe the dump of your unicode db
got unexpectedly converted by improper settings during dumping.

another "funny" thing is:

ita@aipc54:~/.mutt$ LC_ALL=de_AT.utf8@euro sort /tmp/testfile
Abend
Oma
�terreich
�erflieger
Unter
Zetrix

this is also wrong (There should be 'Unter' and then 'U:berflieger'
[�berflieger]). so is this a libc bug?

The sort order is correct, so libc did succeed in its part. Maybe your
terminal is having issues with utf-8? If you're using xterm: Did you
run it with -u8 or some utf-8-enabling X-resource? To verify that the
terminal is working properly, typing

echo � > /tmp/foo
file /tmp/foo

on a shell should tell you that you have a utf-8 text file.

HTH
Andreas
--

#5Andreas Seltenreich
seltenreich@gmx.de
In reply to: Andreas Seltenreich (#4)
Re: Sorting Problem in UNICODE/german

Sorry, I just reread your mail: Your MUA is declaring it with

Content-Type: text/plain; charset=unknown-8bit

This makes it even harder to discuss problems with Umlauts :-).

Andreas Seltenreich schrob:

Klaus Ita schrob:

another "funny" thing is:

ita@aipc54:~/.mutt$ LC_ALL=de_AT.utf8@euro sort /tmp/testfile
Abend
Oma
�terreich
�erflieger
Unter
Zetrix

this is also wrong (There should be 'Unter' and then 'U:berflieger'
[�berflieger]). so is this a libc bug?

I think I got your point now. Libc appears to be using iso-14651
sorting for all "de" locales. I'm afraid you will have compile a
customized locale to depart from that.

regards,
Andreas
--

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Seltenreich (#5)
Re: Sorting Problem in UNICODE/german

Andreas Seltenreich <andreas+pg@gate450.dyndns.org> writes:

Klaus Ita schrob:

this is also wrong (There should be 'Unter' and then 'U:berflieger'
[�berflieger]). so is this a libc bug?

I think I got your point now. Libc appears to be using iso-14651
sorting for all "de" locales. I'm afraid you will have compile a
customized locale to depart from that.

I wouldn't call it a libc bug, but a bug in the locale definition.
In any case it doesn't appear to be Postgres' problem --- if we sort
the same way "sort" does under the same locale setting, then we are
doing what we expect.

I think at this point Klaus needs to find some people who know about
hacking locale definitions. I sure don't know enough about them to
help further. Is there a libc mailing list anywhere?

One thing I do know --- if you install a new version of the locale
Postgres is using, you'd better re-initdb, or at least REINDEX all
your indexes on textual columns. Changing sort order is equivalent
to making such indexes corrupt.

regards, tom lane