help understanding collation order

Started by rafabout 15 years ago7 messagesgeneral
Jump to latest
#1raf
raf@raf.org

hi,

postgresl-8.4.6 (database encoding is utf8)

the following sql behaves strangely on one of my servers:

create table ttt(id serial not null primary key, name text);
insert into ttt (name) values ('CLARKE, DAVID');
insert into ttt (name) values ('CLARK, PETER');
insert into ttt (name) values ('CLARKE');
insert into ttt (name) values ('CLARK');
select * from ttt order by name;
drop table ttt;

the behaviour i expect (and see on macosx-10.6.6) is:

id | name
----+---------------
4 | CLARK
2 | CLARK, PETER
3 | CLARKE
1 | CLARKE, DAVID

the behaviour i don't expect but see anyway (on debian-5.0) is:

id | name
----+---------------
4 | CLARK
3 | CLARKE
1 | CLARKE, DAVID
2 | CLARK, PETER

the "good" server has lc_messages='en_AU' and the
"bad" server has lc_messages="en_AU.utf8" which may
be relevant but i can't successfully alter these
values (i.e. postgres doesn't restart after i try)
so i'm hoping it's not relevant.

can anyone explain what i've done wrong here and
suggest what i can do to fix it?

cheers,
raf

p.s. the postgres installation was PostgresPlus/8.4SS that
was subsequently upgraded via postgres one click installers.

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: raf (#1)
Re: help understanding collation order

raf <raf@raf.org> writes:

the behaviour i expect (and see on macosx-10.6.6) is:

id | name
----+---------------
4 | CLARK
2 | CLARK, PETER
3 | CLARKE
1 | CLARKE, DAVID

the behaviour i don't expect but see anyway (on debian-5.0) is:

id | name
----+---------------
4 | CLARK
3 | CLARKE
1 | CLARKE, DAVID
2 | CLARK, PETER

the "good" server has lc_messages='en_AU' and the
"bad" server has lc_messages="en_AU.utf8" which may
be relevant

No, not particularly. Sort order is determined by lc_collate
not lc_messages. Unfortunately it's entirely possible that OSX
will give you a different sort order than Linux even for similarly
named lc_collate settings. About the only lc_collate setting that
really behaves the same everywhere, guaranteed, is "C" ... and that
might or might not do what you want. (C locale does satisfy the
above example but it's hard to be sure what you want in general;
and if you are using any non-ASCII characters, C locale will more
than likely not be very satisfactory.)

regards, tom lane

#3Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Tom Lane (#2)
Re: help understanding collation order

On Mon, Jan 17, 2011 at 02:19:14PM -0500, Tom Lane wrote:

No, not particularly. Sort order is determined by lc_collate
not lc_messages. Unfortunately it's entirely possible that OSX
will give you a different sort order than Linux even for similarly
named lc_collate settings. About the only lc_collate setting that
really behaves the same everywhere, guaranteed, is "C"

This is partly because of the generous room allowed for linguistic
tailoring in the Unicode standard on collation. In case one really
wants to have a bad day, I can suggest reading
http://unicode.org/reports/tr10/ carefully. But they make an
important point there, which is that collation rules work by language,
not by script. As I understand things, in Postgres's case it's partly
a matter of how strongly your OS cleaves to the locale conventions
that determines how this will work. (Note that not every database
system relies on the underlying OS's facilities the way Postgres does;
some have an independent collation mechanism.)

Unicode does maintain a locale data repository:
http://cldr.unicode.org/. You might be able to figure out which of
your systems is not playing nice and complain to the OS vendor.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

#4raf
raf@raf.org
In reply to: Tom Lane (#2)
Re: help understanding collation order

Tom Lane wrote:

raf <raf@raf.org> writes:

the behaviour i expect (and see on macosx-10.6.6) is:

id | name
----+---------------
4 | CLARK
2 | CLARK, PETER
3 | CLARKE
1 | CLARKE, DAVID

the behaviour i don't expect but see anyway (on debian-5.0) is:

id | name
----+---------------
4 | CLARK
3 | CLARKE
1 | CLARKE, DAVID
2 | CLARK, PETER

the "good" server has lc_messages='en_AU' and the
"bad" server has lc_messages="en_AU.utf8" which may
be relevant

No, not particularly. Sort order is determined by lc_collate
not lc_messages. Unfortunately it's entirely possible that OSX
will give you a different sort order than Linux even for similarly
named lc_collate settings. About the only lc_collate setting that
really behaves the same everywhere, guaranteed, is "C" ... and that
might or might not do what you want. (C locale does satisfy the
above example but it's hard to be sure what you want in general;
and if you are using any non-ASCII characters, C locale will more
than likely not be very satisfactory.)

regards, tom lane

thanks. "C" will have to do, i suppose.
that and/or re-sort in the client.

cheers,
raf

p.s. if anyone in debian locale land is listening,
'E' does not sort before ','. what were you thinking? :-)

#5Peter Eisentraut
peter_e@gmx.net
In reply to: raf (#4)
Re: help understanding collation order

On tis, 2011-01-18 at 10:33 +1100, raf wrote:

p.s. if anyone in debian locale land is listening,
'E' does not sort before ','. what were you thinking? :-)

What is actually happening is that the punctuation is sorted in a second
pass after the letters. Which is both correct according to the relevant
standards and also practical in many situations.

It's usually actually the Mac OS X locales that are broken.

#6raf
raf@raf.org
In reply to: Peter Eisentraut (#5)
Re: help understanding collation order

Peter Eisentraut wrote:

On tis, 2011-01-18 at 10:33 +1100, raf wrote:

p.s. if anyone in debian locale land is listening,
'E' does not sort before ','. what were you thinking? :-)

What is actually happening is that the punctuation is sorted in a second
pass after the letters. Which is both correct according to the relevant
standards and also practical in many situations.

i have no doubt that what you say is true just as i have no doubt that
it is also incorrect and impractical in every situation i'll ever encounter.
i can't imagine ever wanting CLARK to sort both before and after CLARKE.
i'll just re-sort all name-ordered reports in the client. pity.

It's usually actually the Mac OS X locales that are broken.

only when ignoring the principle of least astonishment. :-)

but it's not the same locale on the two hosts. the macosx
locale is en_AU. the debian locale is en_AU.utf8 so i don't
think they can be compared for brokenness purposes in this case.
they may both be correct.

cheers,
raf

#7raf
raf@raf.org
In reply to: raf (#6)
Re: help understanding collation order

raf wrote:

Peter Eisentraut wrote:

On tis, 2011-01-18 at 10:33 +1100, raf wrote:

p.s. if anyone in debian locale land is listening,
'E' does not sort before ','. what were you thinking? :-)

What is actually happening is that the punctuation is sorted in a second
pass after the letters. Which is both correct according to the relevant
standards and also practical in many situations.

i have no doubt that what you say is true just as i have no doubt that
it is also incorrect and impractical in every situation i'll ever encounter.
i can't imagine ever wanting CLARK to sort both before and after CLARKE.

i'll just re-sort all name-ordered reports in the client. pity.

better idea: i'll separate the family name from the personal
names using regexp_replace() and use them separately in the
order by clause so the sorting can remain in postgres where
it belongs without any commas getting in the way.

cheers,
raf