utf-8 and cultural sensitive sorting

Started by Nonameover 20 years ago4 messagesgeneral
Jump to latest
#1Noname
sknipe@tucows.com

Our product will be storing its character data in utf-8
format (unicode encoding).

What is the best way to achive cultural sensitive sorting
using the utf-8 data?

Is it possible have the locale apply to a connection?

If so, is the cultural sorting support mature in
PostgreSQL?

What type of performance can be expected as compared with
the normal c locale sorting?

Thanks very much,

Steve.

#2Richard Huxton
dev@archonet.com
In reply to: Noname (#1)
Re: utf-8 and cultural sensitive sorting

sknipe@tucows.com wrote:

Our product will be storing its character data in utf-8 format (unicode
encoding).

What is the best way to achive cultural sensitive sorting using the
utf-8 data?

See below.

Is it possible have the locale apply to a connection?

A locale applies to a whole database cluster. That's one locale only.

If so, is the cultural sorting support mature in PostgreSQL?

What type of performance can be expected as compared with the normal c
locale sorting?

Maturity and performance depend upon the underlying libraries, which
depend upon your O.S. - I recommend testing.

There has been discussion of how to support mixed locales, possibly even
within the same column but I don't know whether anyone is working on
this at the present time. The mailing list archives will contain plenty
of discussion on this.

Does that help?
--
Richard Huxton
Archonet Ltd

#3Alex Stapleton
alexs@advfn.com
In reply to: Noname (#1)
Re: utf-8 and cultural sensitive sorting

It depends what language you want to sort. Lots of languages do not
have a sort alphabet. For example, Japanese. It can be quite
difficult to sort unusual languages like this. I am not aware of any
standard technique for sorting Japanese text other than keeping an
arbitrarily sorted dictionary (courtesy of whatever the most popular
Japanese dictionary at the time happens to be perhaps) and then doing
hash lookups in the for indexing values. As you can imagine, this is
not particularly fast. I have not actually tried this, but I expect
PosgreSQL will simply sort in a fairly binary fashion. As in, it gets
sorted in according to the binary value of the characters, or the
UTF-8 offsets, or something like that.

On 12 Jul 2005, at 15:48, <sknipe@tucows.com> <sknipe@tucows.com> wrote:

Show quoted text

Our product will be storing its character data in utf-8 format
(unicode encoding).

What is the best way to achive cultural sensitive sorting using the
utf-8 data?

Is it possible have the locale apply to a connection?

If so, is the cultural sorting support mature in PostgreSQL?

What type of performance can be expected as compared with the
normal c locale sorting?

Thanks very much,

Steve.

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Alex Stapleton (#3)
Re: utf-8 and cultural sensitive sorting

It depends what language you want to sort. Lots of languages do not
have a sort alphabet. For example, Japanese. It can be quite
difficult to sort unusual languages like this. I am not aware of any
standard technique for sorting Japanese text other than keeping an
arbitrarily sorted dictionary (courtesy of whatever the most popular
Japanese dictionary at the time happens to be perhaps) and then doing
hash lookups in the for indexing values. As you can imagine, this is
not particularly fast. I have not actually tried this, but I expect
PosgreSQL will simply sort in a fairly binary fashion. As in, it gets
sorted in according to the binary value of the characters, or the
UTF-8 offsets, or something like that.

Above is almost correct but usually sorting by the JIS code order is
enough for most Japanese applications (I believe same thing can be
said to Chinese). I do not recommend using locale for sorting
Japanese. It quite frequently happens that the locale support for
multibyte encodings is totally broken. See recent posting titled
"[GENERAL] Japanese words not distinguished" for more details.

If you have to live with UTF-8 database, I recommend turning off the
locale support and use CONVERT to sort Japanese. For example,

SELECT * FROM t1 ORDER BY CONVERT(col1 USING utf_8_to_euc_jp);

Show quoted text

On 12 Jul 2005, at 15:48, <sknipe@tucows.com> <sknipe@tucows.com> wrote:

Our product will be storing its character data in utf-8 format
(unicode encoding).

What is the best way to achive cultural sensitive sorting using the
utf-8 data?

Is it possible have the locale apply to a connection?

If so, is the cultural sorting support mature in PostgreSQL?

What type of performance can be expected as compared with the
normal c locale sorting?

Thanks very much,

Steve.

---------------------------(end of
broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that
your
message can get through to the mailing list cleanly

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend