collation & UTF-8

Started by Tomi NAover 20 years ago4 messagesgeneral

hefest@gmail.com

over 20 years ago

I'm using PosgreSQL 8.1.2 on linux and want to load UTF-8 encoded varchars.
While I can store and get at stored text correctly, the ORDER BY places all
accented characters (Croatian, in this case - probably marked hr_HR) after
non-accented characters.
This is no showstopper, but it does affect the general perception of
application quality.

Now, I've seen the issue mentioned in a number of places, but often with
fairly old versions of pgsql (<8.0), in different circumstances etc. so my
question is:
is there an official way to set up UTF8 collation so that "SELECT first_name
FROM persons ORDER BY first_name" works as expected?

TIA,
Tomislav

Martijn van Oosterhout

kleptog@svana.org

over 20 years ago

In reply to: Tomi NA (#1)

Re: collation & UTF-8

On Fri, Feb 24, 2006 at 06:23:07PM +0100, Tomi NA wrote:

I'm using PosgreSQL 8.1.2 on linux and want to load UTF-8 encoded varchars.
While I can store and get at stored text correctly, the ORDER BY places all
accented characters (Croatian, in this case - probably marked hr_HR) after
non-accented characters.
This is no showstopper, but it does affect the general perception of
application quality.

Collation is a function of the OS. Basically, is the locale of your
database setup for UTF-8 collation? It would probably be called
hr_HR.UTF-8.

is there an official way to set up UTF8 collation so that "SELECT first_name
FROM persons ORDER BY first_name" works as expected?

Yes, setup the locale correctly. In general, postgresql should give the
same results as sort(1) on the command-line. Use that to experiment.

LC_ALL=hr_HR.UTF-8 sort < input > output

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

Tomi NA

hefest@gmail.com

over 20 years ago

In reply to: Martijn van Oosterhout (#2)

Re: collation & UTF-8

On 2/24/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

On Fri, Feb 24, 2006 at 06:23:07PM +0100, Tomi NA wrote:

I'm using PosgreSQL 8.1.2 on linux and want to load UTF-8 encoded

varchars.

While I can store and get at stored text correctly, the ORDER BY places

all

accented characters (Croatian, in this case - probably marked hr_HR)

after

non-accented characters.
This is no showstopper, but it does affect the general perception of
application quality.

Collation is a function of the OS. Basically, is the locale of your
database setup for UTF-8 collation? It would probably be called
hr_HR.UTF-8.

You were right about this:
LC_ALL=hr_HR.UTF-8 sort < test.txt
(seemingly) collates the same way that pgsql does. Accented letters at the
end of the alphabet. I've tried hr_HR.UTF8 as well, without results.
Btw, my database is created with
CREATE DATABASE mydb
WITH OWNER = postgres
ENCODING = 'UTF8'
TABLESPACE = pg_default;

Yes, setup the locale correctly. In general, postgresql should give the

same results as sort(1) on the command-line. Use that to experiment.

LC_ALL=hr_HR.UTF-8 sort < input > output

I'm very sorry to report it does not work. :(
Btw,
set | grep LC_
returns nothing...is this a possible source of the problem?

Tomislav

Tom Lane

tgl@sss.pgh.pa.us

over 20 years ago

In reply to: Tomi NA (#3)

Re: collation & UTF-8

Tomi NA <hefest@gmail.com> writes:

You were right about this:
LC_ALL=3Dhr_HR.UTF-8 sort < test.txt
(seemingly) collates the same way that pgsql does. Accented letters at the
end of the alphabet. I've tried hr_HR.UTF8 as well, without results.

If you're not sure what locales are available on your system, run
"locale -a". I don't think "sort" will complain about an unknown locale
setting, it'll probably just fall back to "C" locale.

regards, tom lane