sort order
Hi,
context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252
create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');
differences include a french é character, and some white spaces at the
beginning.
then select * from test order by ble;
centos result:
ble | id
-------+----
| 3
bobo | 2
bobo | 1
élise | 6
marc | 5
marc | 4
Marc | 8
Marc | 7
(8 lignes)
w10 result:
ble | id
-------+----
| 3
bobo | 1
marc | 4
Marc | 7
élise | 8
bobo | 2
marc | 5
Marc | 6
(8 lignes)
so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.
I didn't see anything about this behaviour on the doc, unless the reference
at the libc should be understood as please read and test libc doc on each
platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10
one ??
ie. knowing about french characters AND taking beginning white spaces into
account ?
thanks,
regards,
Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com
Re-reading my post, I see that even the élise is not sorted correctly on
w10...
Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com
On Tue, Jul 27, 2021 at 7:07 PM Marc Millas <marc.millas@mokadb.com> wrote:
Show quoted text
Hi,
context:
one postgres 12 on centos 7
one postgres 12 on windows 10
both on machines with french as default
the centos 7 lc_collate and lc_ctype: fr_FR.UTF_8
the w10 lc_collate and lc_ctype: French_France.1252create table test (ble text, id serial primary key);
insert into test(ble) values(' ');
insert into test(ble) values('Marc');
insert into test(ble) values(' Marc');
insert into test(ble) values('marc');
insert into test(ble) values(' marc');
insert into test(ble) values('bobo');
insert into test(ble) values(' bobo'):
insert into test(ble) values('élise');differences include a french é character, and some white spaces at the
beginning.then select * from test order by ble;
centos result:
ble | id
-------+----
| 3
bobo | 2
bobo | 1
élise | 6
marc | 5
marc | 4
Marc | 8
Marc | 7
(8 lignes)w10 result:
ble | id
-------+----
| 3
bobo | 1
marc | 4
Marc | 7
élise | 8
bobo | 2
marc | 5
Marc | 6
(8 lignes)so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.I didn't see anything about this behaviour on the doc, unless the
reference at the libc should be understood as please read and test libc doc
on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the w10
one ??
ie. knowing about french characters AND taking beginning white spaces into
account ?thanks,
regards,
Marc MILLAS
Senior Architect
+33607850334
www.mokadb.com
Marc Millas <marc.millas@mokadb.com> writes:
Re-reading my post, I see that even the élise is not sorted correctly on
w10...
Sort orders aren't even particularly guaranteed on different releases
of the same platform, let alone totally different platforms. glibc
made major changes to their collation rules not long ago:
https://wiki.postgresql.org/wiki/Locale_data_changes
At one time we thought that migrating to ICU sorting might bring
a little more stability to this, but I think what we've found so
far is that it's just different :-(
regards, tom lane
On 27.07.21 19:07, Marc Millas wrote:
so, obviously, both lc_collate knows about the é
but obviously, too, they do behave differently on the impact of the
beginning white space.I didn't see anything about this behaviour on the doc, unless the
reference at the libc should be understood as please read and test libc
doc on each platform.
So my first question is: why ?
My second question is: how to make the centos postgres behave like the
w10 one ??
ie. knowing about french characters AND taking beginning white spaces
into account ?
There are multiple standard ways to deal with space and punctuation
characters when sorting. See
<https://unicode-org.github.io/icu/userguide/collation/customization/ignorepunct.html>
for a description. Not all collation providers implement all of them,
but the behavior you happen to get is usually one of them. The centos 7
behavior corresponds to "shift-trimmed", the Windows one appears to
match "non-ignorable". If you want to get that latter one on Linux as
well, you can use the ICU locales, which also default to non-ignorable.
For example
select * from test order by ble collate "fr-x-icu";
matches your Windows output for me.