Wrong sorting on docker image

Started by Alexander Voytsekhovskyyover 4 years ago4 messagesgeneral
Jump to latest
#1Alexander Voytsekhovskyy
young.inbox@gmail.com

Greetings

Starting from version 12.0 official docker image switched from Debian-stretch to Debian-bullseye and from that point we have a huge pain with sorting issues on Russian collation.

Dockerfile:

FROM postgres:14
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8
ENV LANG ru_RU.utf8

postgres=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------
PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

Issue:

postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name;
name
------
ёа
ея
(2 строки)

еqя should go before ёqа

postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU";
?column?
----------
f
(1 строка)

And should be TRUE here

Any idea how to fix that?

We are not able manage this for 3 years already (((

#2Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Alexander Voytsekhovskyy (#1)
Re: Wrong sorting on docker image

On 2021-10-16 13:50:31 +0300, Oleksandr Voytsekhovskyy wrote:

Starting from version 12.0 official docker image switched from Debian-stretch
to Debian-bullseye and from that point we have a huge pain with sorting issues
on Russian collation.

[...]

Issue:

postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name;
name
------
ёа
ея
(2 строки)

еqя should go before ёqа

Same with the sort command in the shell.

postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU";
?column?
----------
f
(1 строка)

And should be TRUE here

Any idea how to fix that?

Since the collation is defined by the OS (or rather its C library)
I think this should be reported to Debian or possibly the glibc
maintainers.

PostgreSQL can also use the ICU locales instead of those provided by the
OS. Have you tried that?

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Voytsekhovskyy (#1)
Re: Wrong sorting on docker image

Oleksandr Voytsekhovskyy <young.inbox@gmail.com> writes:

Starting from version 12.0 official docker image switched from Debian-stretch to Debian-bullseye and from that point we have a huge pain with sorting issues on Russian collation.

Yeah, Debian versions after stretch adopted the significant glibc locale
data changes (sorting rule changes) that are discussed at [1]https://wiki.postgresql.org/wiki/Locale_data_changes.

ея should go before ёа

I'm not qualified to have an opinion on that point, but one would hope
that the glibc people who changed the sorting rules are qualified.
If you disagree, you need to go discuss it with glibc. Postgres doesn't
define any text sorting rules, we just use what libc or ICU tells us.

Speaking of ICU, if you are using an ICU-enabled Postgres build,
maybe you could find an ICU collation that acts the way you want.
This wouldn't be a perfect solution, because we don't yet have
the ability to set an ICU collation as a database's default.
But you can attach ICU collations to individual text columns,
and maybe that would be a good enough workaround.

regards, tom lane

[1]: https://wiki.postgresql.org/wiki/Locale_data_changes

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#3)
Re: Wrong sorting on docker image

On Sun, Oct 17, 2021 at 4:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Speaking of ICU, if you are using an ICU-enabled Postgres build,
maybe you could find an ICU collation that acts the way you want.
This wouldn't be a perfect solution, because we don't yet have
the ability to set an ICU collation as a database's default.
But you can attach ICU collations to individual text columns,
and maybe that would be a good enough workaround.

For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with
glibc on these sort orders, so I suspect this might be coming from
CLDR/UCA/DUCET/ISO 14651 common/synchronised data. It does look quite
suspicious to me, but I don't know Russian and I'm only speculating
wildly here: it does look as if ё is perhaps getting a lower weight
than it should. That said, it seems strange that something so basic
should be wrong. Nosing around in the unicode.org issue tracker, it
seems as though some people might think there is something funny about
Ё (and I wonder if there are/were similar issues with й/Й):

https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22
https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22
(and more)

It's probably not a great idea, but for the record, you can build your
own collation for glibc and other POSIX-oid systems. For example, see
glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they
previously had customisations on top of the iso14651_t1 file to
reorder a special Ukrainian character in ru_RU, so in theory you could
reorder ё/Ё with a similar local hack and call it ru_RU_X... I also
wonder if there is some magic switch you can put after an @ symbol on
ICU collations that would change this, perhaps some way to disable the
"contractions" that are potentially implicated here. Not sure.