Supporting non-deterministic collations with tailoring rules.

Started by Todd Lang4 months ago4 messages
#1Todd Lang
Todd.Lang@D2L.com

Reposting this here from the Discord server as requested:

When creating a collation, in https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/pg_locale_icu.c#L461 it is opening the collator with the tailoring rules supplied. However, it has hardcoded the strength level UCOL_DEFAULT_STRENGTH. This has the effect of ignoring the "deterministic=false" you may have specified in your CREATE COLLATION call. If, instead of UCOL_DEFAULT_STRENGTH, the code understood the deterministic parameter and passed either UCOL_PRIMARY for "deterministic=true", and UCOL_SECONDARY for "deterministic=false", this would preserve the attempt to obtain case-insensitivity in the locale while simultaneously allowing tailoring as expected.

I have made the modification to the pg_locale_icu.c and tested it locally (simply hardcoding UCOL_SECONDARY - not checking the deterministic parameter) and it behaves as expected, though I freely admit my knowledge of ICU intersecting with Postgres is rather limited.

#2Daniel Verite
daniel@manitou-mail.org
In reply to: Todd Lang (#1)
1 attachment(s)
Re: Supporting non-deterministic collations with tailoring rules.

Todd Lang wrote:

When creating a collation, in
https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/pg_locale_icu.c#L461
it is opening the collator with the tailoring rules supplied. However, it
has hardcoded the strength level UCOL_DEFAULT_STRENGTH. This has the effect
of ignoring the "deterministic=false" you may have specified in your CREATE
COLLATION call.

This is related to BUG #18771 previously reported at [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org,
where the reporter notes that passing UCOL_DEFAULT works
for him whereas UCOL_DEFAULT_STRENGTH does not.
It looks like a documentation bug in ICU [2]https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
It says:

strength: The default collation strength; one of UCOL_PRIMARY,
UCOL_SECONDARY, UCOL_TERTIARY, UCOL_IDENTICAL,UCOL_DEFAULT_STRENGTH
- can be also set in the rules.

But UCOL_DEFAULT_STRENGTH is an alias for UCOL_TERTIARY.
U_COL_DEFAULT is what should normally be passed to not override the
collation strength.

Now, by "it works", it means that the strength expressed in the rule
(with rules = '[strength 1]' in the case of the OP) takes effect.
This syntax is described at [3]https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options (see "Rule Syntax" column)

There is a second problem: when the strength is specified in
the locale and not specified in the rules (as you did), it would also
be expected to take effect. It does not appear to be the case,
as if the rules were resetting the collation settings.
As mentioned in the thread at [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org, Peter Eisentraut has submitted
this as a bug [4]https://unicode-org.atlassian.net/browse/ICU-22456, but there hasn't been any follow-up to it
in 2.5 years.

If, instead of UCOL_DEFAULT_STRENGTH, the code understood the
deterministic parameter and passed either UCOL_PRIMARY for
"deterministic=true", and UCOL_SECONDARY for "deterministic=false",
this would preserve the attempt to obtain case-insensitivity in the
locale while simultaneously allowing tailoring as expected.

We can't hardcode that deterministic=false implies that the strength
is 2. deterministic=false only says that the collation can have equal
strings that are not binary-equal.

To me, the most plausible fix on the Postgres side would be to pass
UCOL_DEFAULT instead of UCOL_DEFAULT_STRENGTH as in the attached,
which lets the user specify the strength in the rule, as the OP did in [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org.

[1]: /messages/by-id/18771-98bb23e455b0f367@postgresql.org
/messages/by-id/18771-98bb23e455b0f367@postgresql.org
[2]: https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
[3]: https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[4]: https://unicode-org.atlassian.net/browse/ICU-22456

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Attachments:

rules-ucol-default-strength.difftext/x-patch; name=rules-ucol-default-strength.diffDownload
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 96741e08269..e84ea5057e8 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -459,7 +459,7 @@ make_icu_collator(const char *iculocstr, const char *icurules)
 
 		status = U_ZERO_ERROR;
 		collator_all_rules = ucol_openRules(all_rules, u_strlen(all_rules),
-											UCOL_DEFAULT, UCOL_DEFAULT_STRENGTH,
+											UCOL_DEFAULT, UCOL_DEFAULT,
 											NULL, &status);
 		if (U_FAILURE(status))
 		{
#3Todd Lang
Todd.Lang@D2L.com
In reply to: Daniel Verite (#2)
RE: Supporting non-deterministic collations with tailoring rules.

Ah, somehow I missed your email on this. This is, in fact, exactly what should happen. The ICU folks are updating their documentation to reflect this with https://github.com/unicode-org/icu/pull/3684/files .

Is this small change a reasonable thing to include given the update in guidance from the ICU team?

-----Original Message-----
From: Daniel Verite <daniel@manitou-mail.org>
Sent: Wednesday, September 24, 2025 6:17 AM
To: Todd Lang <Todd.Lang@D2L.com>
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: Supporting non-deterministic collations with tailoring rules.

CAUTION: This email originated from outside of D2L. Do not respond to, click links or open attachments unless you recognize the sender and know the content is safe.

Todd Lang wrote:

When creating a collation, in
https://gith/
ub.com%2Fpostgres%2Fpostgres%2Fblob%2Fmaster%2Fsrc%2Fbackend%2Futils%2
Fadt%2Fpg_locale_icu.c%23L461&data=05%7C02%7CTodd.Lang%40D2L.com%7Cb34
6f047ed7944ebe01408ddfb5391b2%7C74bbca6d410b45b39b512a6aa6477079%7C0%7
C0%7C638943058554088325%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydW
UsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D
%7C0%7C%7C%7C&sdata=4%2F6%2BalfTMFrnAzjQBt4i9Qa1BUUWBpaUJnGz%2B8dvy1s%
3D&reserved=0 it is opening the collator with the tailoring rules
supplied. However, it has hardcoded the strength level
UCOL_DEFAULT_STRENGTH. This has the effect of ignoring the
"deterministic=false" you may have specified in your CREATE COLLATION
call.

This is related to BUG #18771 previously reported at [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org, where the reporter notes that passing UCOL_DEFAULT works for him whereas UCOL_DEFAULT_STRENGTH does not.
It looks like a documentation bug in ICU [2]https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a It says:

strength: The default collation strength; one of UCOL_PRIMARY,
UCOL_SECONDARY, UCOL_TERTIARY, UCOL_IDENTICAL,UCOL_DEFAULT_STRENGTH
- can be also set in the rules.

But UCOL_DEFAULT_STRENGTH is an alias for UCOL_TERTIARY.
U_COL_DEFAULT is what should normally be passed to not override the collation strength.

Now, by "it works", it means that the strength expressed in the rule (with rules = '[strength 1]' in the case of the OP) takes effect.
This syntax is described at [3]https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options (see "Rule Syntax" column)

There is a second problem: when the strength is specified in the locale and not specified in the rules (as you did), it would also be expected to take effect. It does not appear to be the case, as if the rules were resetting the collation settings.
As mentioned in the thread at [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org, Peter Eisentraut has submitted this as a bug [4]https://unicode-org.atlassian.net/browse/ICU-22456, but there hasn't been any follow-up to it in 2.5 years.

If, instead of UCOL_DEFAULT_STRENGTH, the code understood the
deterministic parameter and passed either UCOL_PRIMARY for
"deterministic=true", and UCOL_SECONDARY for "deterministic=false",
this would preserve the attempt to obtain case-insensitivity in the
locale while simultaneously allowing tailoring as expected.

We can't hardcode that deterministic=false implies that the strength is 2. deterministic=false only says that the collation can have equal strings that are not binary-equal.

To me, the most plausible fix on the Postgres side would be to pass UCOL_DEFAULT instead of UCOL_DEFAULT_STRENGTH as in the attached, which lets the user specify the strength in the rule, as the OP did in [1]/messages/by-id/18771-98bb23e455b0f367@postgresql.org.

[1]: /messages/by-id/18771-98bb23e455b0f367@postgresql.org
/messages/by-id/18771-98bb23e455b0f367@postgresql.org
[2]: https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a0cb1ddd81f322ed24e389f208eb35c8a
[3]: https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[4]: https://unicode-org.atlassian.net/browse/ICU-22456

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#4Daniel Verite
daniel@manitou-mail.org
In reply to: Todd Lang (#3)
RE: Supporting non-deterministic collations with tailoring rules.

Todd Lang wrote:

The ICU folks are updating their documentation to reflect
this with https://github.com/unicode-org/icu/pull/3684/files .

Is this small change a reasonable thing to include given the update in
guidance from the ICU team?

I think so. Added to the commitfest:

https://commitfest.postgresql.org/patch/6084/

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/