Losing my latin on Ordering...

Started by Dominique Devienneabout 3 years ago7 messagesgeneral

ddevienne@gmail.com

about 3 years ago

Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.

We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we
get
does not make sense to me. The same prefix can be sorted differently based
on the suffix apprently, which doesn't make any sense to me.

Surely sorting should be "constant left-to-right", no? What are we missing?

I'm already surprised (star) comes before (space), when the latter "comes
before" the former in both ASCII and UTF-8, but that the two "Foo*" and
"Foo " prefixed pairs are not clustered after sorting is just mistifying to
me. So how come?

For now we can work-around this by explicitly adding the `collate "C"` on
the queries underlying that particular test, but that would be wrong in the
general case of international strings to sort, so I'd really like to
understand what's going on.

Thanks, --DD

PS: if I try "en_US.UTF-8" or "en_US"."UTF-8" for the collate, it fails.
How come what pg_database.datcollate displays is not a valid value for
collate?

PPS: We tried on v12 and v13 I believe. Somehow my v14.2 on Windows doesn't
have en_US as a collation...

ddevienne=> with t(v) as (values ('Foo All'), ('Foo*'), ('Foo Brief'),
('Foo*All')) select '<'||v||'>', v::bytea from t order by v collate "en_US";
?column? | v
-------------+----------------------
<Foo*> | \x466f6f2a
<Foo All> | \x466f6f20416c6c
<Foo*All> | \x466f6f2a416c6c
<Foo Brief> | \x466f6f204272696566
(4 rows)

ddevienne=> with t(v) as (values ('Foo All'), ('Foo*'), ('Foo Brief'),
('Foo*All')) select '<'||v||'>', v::bytea from t order by v collate "C";
?column? | v
-------------+----------------------
<Foo All> | \x466f6f20416c6c
<Foo Brief> | \x466f6f204272696566
<Foo*> | \x466f6f2a
<Foo*All> | \x466f6f2a416c6c
(4 rows)

Laurenz Albe

laurenz.albe@cybertec.at

about 3 years ago

In reply to: Dominique Devienne (#1)

Re: Losing my latin on Ordering...

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:

Hi. Porting a unit test to PostgreSQL, we got a failure related to ordering.

We've distilled it to the below. The DB is en_US.UTF-8, and the sorting we get
does not make sense to me. The same prefix can be sorted differently based on
the suffix apprently, which doesn't make any sense to me.

Surely sorting should be "constant left-to-right", no? What are we missing?

No, it isn't. That's not how natural language collations work.

They typically use different levels of comparison: first, strings are sorted
according to base character, ignoring accents, case and punctuation.
Wherever that comparison is equal, the next level is used (typically accents),
then the next (case), and so on.

I'm already surprised (star) comes before (space), when the latter "comes
before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

Because they compare identical on the first three levels. Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

For now we can work-around this by explicitly adding the `collate "C"` on
the queries underlying that particular test, but that would be wrong in the
general case of international strings to sort, so I'd really like to understand
what's going on.

Yes, it soulds like the "C" collation may be best for you. That is, if you don't
mind that "Z" < "a".

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Dominique Devienne

ddevienne@gmail.com

about 3 years ago

In reply to: Laurenz Albe (#2)

Re: Losing my latin on Ordering...

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@cybertec.at>
wrote:

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:

Hi. Porting a unit test to PostgreSQL, we got a failure related to

ordering.

We've distilled it to the below. The DB is en_US.UTF-8, and the sorting

we get

does not make sense to me. The same prefix can be sorted differently

based on

the suffix apprently, which doesn't make any sense to me.

Surely sorting should be "constant left-to-right", no? What are we

missing?

No, it isn't. That's not how natural language collations work.

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a
given collation are?

I'm aware of sorting taking numerical numbers in text influencing sort, so
"Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no
logical relatioship, like 9 and 10 do.

I'm already surprised (star) comes before (space), when the latter "comes
before" the former in both ASCII and UTF-8, but that the two "Foo*" and

"Foo "

prefixed pairs are not clustered after sorting is just mistifying to me.

So how come?

Because they compare identical on the first three levels. Any difference
in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

That's completely unintuitive...

For now we can work-around this by explicitly adding the `collate "C"` on
the queries underlying that particular test, but that would be wrong in

the

general case of international strings to sort, so I'd really like to

understand

what's going on.

Yes, it soulds like the "C" collation may be best for you. That is, if
you don't
mind that "Z" < "a".

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

Alvaro Herrera

alvherre@2ndquadrant.com

about 3 years ago

In reply to: Dominique Devienne (#3)

Re: Losing my latin on Ordering...

On 2023-Feb-14, Dominique Devienne wrote:

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?

Look, we don't define the collation rules. We just grab the collation
rules defined by experts in collations. In this case the experts have
advised the glibc developers to write collations this way; but even if
you went further and looked at the ICU libraries, you would find that
they have pretty much the same definition.

How does one even find out what the (capricious?) rules for sorting in a
given collation are?

I suggest to look at a telephone book carefully sometime (provided you
can find one ... apparently nobody wants them anymore.)

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

Sure, just make sure to use the definition of C that uses UTF-8 encoding
(I think it's typically called C.UTF-8).

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Laurenz Albe

laurenz.albe@cybertec.at

about 3 years ago

In reply to: Dominique Devienne (#3)

Re: Losing my latin on Ordering...

On Tue, 2023-02-14 at 12:17 +0100, Dominique Devienne wrote:

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:

Surely sorting should be "constant left-to-right", no? What are we missing?

No, it isn't. That's not how natural language collations work.

Honestly, who expects the same prefix to sort differently based on what comes
after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a given
collation are?

Look at the documentation / implementation.

As far as ICU is concerned, here: https://unicode.org/reports/tr10/

I'm already surprised (star) comes before (space), when the latter "comes
before" the former in both ASCII and UTF-8, but that the two "Foo*" and "Foo "
prefixed pairs are not clustered after sorting is just mistifying to me. So how come?

Because they compare identical on the first three levels. Any difference in
letters, accents or case weighs stronger, even if it occurs to the right
of these substrings.

That's completely unintuitive...

Well, you can complain to GNU and the Unicode consortium, but that's pretty
much the way it is.

Yes, it soulds like the "C" collation may be best for you. That is, if you don't
mind that "Z" < "a".

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

Yes, exactly.

Yours,
Laurenz Albe

Dominique Devienne

ddevienne@gmail.com

about 3 years ago

In reply to: Alvaro Herrera (#4)

Re: Losing my latin on Ordering...

On Tue, Feb 14, 2023 at 12:35 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2023-Feb-14, Dominique Devienne wrote:

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?

Look, we don't define the collation rules.

Ok, ok, sorry. To you, Laurenz, and everyone.
I obviously disagree with these rules, but I'm a nobody, so who cares :)

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

Sure, just make sure to use the definition of C that uses UTF-8 encoding
(I think it's typically called C.UTF-8).

OK, so for new DBs, sounds like we need to

CREATE DATABASE ... WITH LOCALE 'C.UTF-8' ENCODING UTF8

Correct?

But what about existing DBs? Can the collation be changed a posteriori?
ALTER DATABASE does not seem to support the same options.

We don't want to have to sprinkle COLLATE "C" all over the place in the
code.
And there are quite a few DBs out there already. What to do about them?

Laurenz Albe

laurenz.albe@cybertec.at

about 3 years ago

In reply to: Dominique Devienne (#6)

Re: Losing my latin on Ordering...

On Tue, 2023-02-14 at 13:06 +0100, Dominique Devienne wrote:

Sure, just make sure to use the definition of C that uses UTF-8 encoding
(I think it's typically called C.UTF-8).

OK, so for new DBs, sounds like we need to

CREATE DATABASE ... WITH LOCALE 'C.UTF-8' ENCODING UTF8

Correct?

Collations are identifiers, so it has to be double quotes.
The name depends on the operating system; if that is Unix-like,
you can run "locale -a" to get all available locales.

On my system it would be

CREATE DATABASE x TEMPLATE template0 LOCALE "C.utf8" ENCODING UTF8;

But what about existing DBs? Can the collation be changed a posteriori?
ALTER DATABASE does not seem to support the same options.

We don't want to have to sprinkle COLLATE "C" all over the place in the code.
And there are quite a few DBs out there already. What to do about them?

The only option is dump/restore.

A changed collation means changed indexes, so there is no better option.

Yours,
Laurenz Albe