sort order

Started by Marc Millasover 4 years ago4 messagesgeneral
Jump to latest
#1Marc Millas
marc.millas@mokadb.com

Hi,

context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252

create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');

differences include a french é character, and some white spaces at the
beginning.

then select * from test order by ble;

centos result:
ble | id
-------+----
| 3
bobo | 2
bobo | 1
élise | 6
marc | 5
marc | 4
Marc | 8
Marc | 7
(8 lignes)

w10 result:
ble | id
-------+----
| 3
bobo | 1
marc | 4
Marc | 7
élise | 8
bobo | 2
marc | 5
Marc | 6
(8 lignes)

so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.

I didn't see anything about this behaviour on the doc, unless the reference
at the libc should be understood as please read and test libc doc on each
platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10
one ??
ie. knowing about french characters AND taking beginning white spaces into
account ?

thanks,

regards,

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

#2Marc Millas
marc.millas@mokadb.com
In reply to: Marc Millas (#1)
Re: sort order

Re-reading my post, I see that even the élise is not sorted correctly on
w10...

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

On Tue, Jul 27, 2021 at 7:07 PM Marc Millas <marc.millas@mokadb.com> wrote:

Show quoted text

Hi,

context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252

create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');

differences include a french é character, and some white spaces at the
beginning.

then select * from test order by ble;

centos result:
ble | id
-------+----
| 3
bobo | 2
bobo | 1
élise | 6
marc | 5
marc | 4
Marc | 8
Marc | 7
(8 lignes)

w10 result:
ble | id
-------+----
| 3
bobo | 1
marc | 4
Marc | 7
élise | 8
bobo | 2
marc | 5
Marc | 6
(8 lignes)

so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.

I didn't see anything about this behaviour on the doc, unless the
reference at the libc should be understood as please read and test libc doc
on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10
one ??
ie. knowing about french characters AND taking beginning white spaces into
account ?

thanks,

regards,

Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Millas (#2)
Re: sort order

Marc Millas <marc.millas@mokadb.com> writes:

Re-reading my post, I see that even the élise is not sorted correctly on
w10...

Sort orders aren't even particularly guaranteed on different releases
of the same platform, let alone totally different platforms. glibc
made major changes to their collation rules not long ago:

https://wiki.postgresql.org/wiki/Locale_data_changes

At one time we thought that migrating to ICU sorting might bring
a little more stability to this, but I think what we've found so
far is that it's just different :-(

regards, tom lane

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Marc Millas (#1)
Re: sort order

On 27.07.21 19:07, Marc Millas wrote:

so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.

I didn't see anything about this behaviour on the doc, unless the
reference at the libc should be understood as please read and test libc
doc on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the
w10 one ??
ie. knowing about french characters AND taking beginning white spaces
into account ?

There are multiple standard ways to deal with space and punctuation
characters when sorting. See
<https://unicode-org.github.io/icu/userguide/collation/customization/ignorepunct.html&gt;
for a description. Not all collation providers implement all of them,
but the behavior you happen to get is usually one of them. The centos 7
behavior corresponds to "shift-trimmed", the Windows one appears to
match "non-ignorable". If you want to get that latter one on Linux as
well, you can use the ICU locales, which also default to non-ignorable.
For example

select * from test order by ble collate "fr-x-icu";

matches your Windows output for me.