Inconsistent string comparison using modified ICU collations

Started by Oleg Tselebrovskiy12 months ago4 messages

o.tselebrovskiy@postgrespro.ru

12 months ago

Greetings, everyone!

I've discovered a bug with string comparison using modified ICU
collations
Using a direct comparison and sorting values gives different results

The easiest way to reproduce is the following:

postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider =
icu, locale = 'en-US-u-kr-latn-digit');
CREATE COLLATION
postgres=# select ('a' < '0' collate "en-US-u-kr-latn-digit-x-icu");
?column?
----------
f
(1 row)
postgres=# select * from (values ('0'),('a')) t(x) order by x collate
"en-US-u-kr-latn-digit-x-icu";
x
---
a
0
(2 rows)

Why does this happen:
In the first example of simple comparison, function varstr_cmp is called
and it
uses ucol_strcoll[UTF8] function to compare two strings, and it seems to
ignore
reordering of character groups in collation;
In the second example of sorting values, function varstr_abbrev_convert
is called
and somewhere deep it uses ucol_getSortKey/ucol_nextSortKeyPart to
transform
source string to SortKey and this transformation takes reordering of
character groups into account

Other way to reproduce this behaviour is to create table, insert the
data into it,
create btree index over this table and sometimes you wouldn't get the
data that is
definitely in the table (if you force postgres to use seqscan the query
works):

postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider =
icu, locale = 'en-US-u-kr-latn-digit');
create table test (col text COLLATE "en-US-u-kr-latn-digit-x-icu");
insert into test values ('a'), ('0');
create index test_idx ON test USING btree (col);
set enable_seqscan = off;
select * from test where col = 'a';
select * from test where col = '0';
CREATE COLLATION
CREATE TABLE
INSERT 0 2
CREATE INDEX
SET
col
-----
(0 rows)

col
-----
(0 rows)

This happens because selecting one row triggers a binary search over the
index,
rows in the index are stored according to modified collation (letters
before digits),
since the creating of index triggers sorting of rows and, therefore,
usage of
ucol_getSortKey/ucol_nextSortKeyPart. But the WHERE filter uses
ucol_strcoll[UTF-8],
that doesn't take modified collation into account

This can be reproduced from REL_13_STABLE up to the current master
(41084409f635453efce03f1114880189b4f6ce4c)

I've opened an issue in ICU Jira[1]https://unicode-org.atlassian.net/browse/ICU-23016 where I have reproduced this
behaviour using
minimal C code

To compose the collation name I have read and used an article by Peter
Eisentraut
on ICU collation settings[2]http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

Unfortunately, I don't have any proposed solution for this issue, but I
thought
it was important to highlight it

Oleg Tselebrovskiy, Postgres Pro

[1]: https://unicode-org.atlassian.net/browse/ICU-23016
[2]: http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings
http://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: Oleg Tselebrovskiy (#1)

Re: Inconsistent string comparison using modified ICU collations

On Wed, 2025-01-22 at 12:03 +0300, Oleg Tselebrovskiy wrote:

In the second example of sorting values, function
varstr_abbrev_convert

Thank you for the report!

I had previously proposed[1]/messages/by-id/d8454f6156956f5991cf9afd2f28978ed8187ca7.camel@j-davis.com a GUC to control abbreviated keys, and it
was rejected. Now that we've seen both a performance problem and a bug
in the underlying dependency, perhaps we should reconsider.

Regards,
Jeff Davis

[1]: /messages/by-id/d8454f6156956f5991cf9afd2f28978ed8187ca7.camel@j-davis.com
/messages/by-id/d8454f6156956f5991cf9afd2f28978ed8187ca7.camel@j-davis.com

Daniel Verite

daniel@manitou-mail.org

12 months ago

In reply to: Oleg Tselebrovskiy (#1)

Re: Inconsistent string comparison using modified ICU collations

Oleg Tselebrovskiy wrote:

I've discovered a bug with string comparison using modified ICU
collations
Using a direct comparison and sorting values gives different results

The easiest way to reproduce is the following:

postgres=# create collation "en-US-u-kr-latn-digit-x-icu" (provider
=
icu, locale = 'en-US-u-kr-latn-digit');

This looks similar to bug #15285:
/messages/by-id/153201618542.1404.3611626898935613264@wrigleys.postgresql.org

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Jeff Davis

pgsql@j-davis.com

12 months ago

In reply to: Jeff Davis (#2)

Re: Inconsistent string comparison using modified ICU collations

On Thu, 2025-01-23 at 13:29 -0800, Jeff Davis wrote:

I had previously proposed[1] a GUC to control abbreviated keys, and
it
was rejected. Now that we've seen both a performance problem and a
bug
in the underlying dependency, perhaps we should reconsider.

Users may:

* trust libc strxfrm, or not
* trust ICU getSortKey & nextSortKeyPart, or not
* want to enable/disable abbreviated keys for testing (perf or
correctness)

And should we make the default the same as the current behavior --
trust ICU but not libc -- or should we trust neither by default?

We might need a few new GUCs here, unless someone has a better idea. I
suppose we could tie it to individual collations, but that seems worse.

Regards,
Jeff Davis