BUG #18771: ICU custom collations with rules ignore collator strength option.

Started by PG Bug reporting formover 1 year ago3 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18771
Logged by: Ruben Ruiz
Email address: ruben.ruizcuadrado@gmail.com
PostgreSQL version: 17.2
Operating system: Debian Linux 12.2
Description:

When using the 'rules' option of CREATE COLLATION to create a custom icu
collation it seems that, if you include inside the rules a change to the
comparison strength, it is ignored. You can reproduce this by creating two
collations that should behave the same, regarding accents and case, but one
has the strength option as part of the locale (ks-level) and the other has
it inside the rules:

-- Create two custom collations that should be case and accent insensitive
postgres=# CREATE COLLATION custom_ci_ai (provider=icu,
locale='und-u-ks-level1', deterministic=false);
CREATE COLLATION
postgres=# CREATE COLLATION custom_ci_ai_with_rules (provider=icu,
locale='und', deterministic=false, rules = '[strength 1]');
CREATE COLLATION

-- Test: both comparisons should be true
postgres=# SELECT 'a'='á' COLLATE custom_ci_ai as no_rules, 'a'='á' COLLATE
custom_ci_ai_with_rules as with_rules;
no_rules | with_rules
----------+------------
t | f
(1 row)

I think the problem might reside in the call to ucol_openRules inside the
make_icu_collator function at pg_locale_icu.c
(https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/pg_locale_icu.c#L367).
Apparently if you pass UCOL_DEFAULT_STRENGTH to the 'stregth' parameter, the
resulting collator will use the default strength (which in my case was
equivalent to level3), even if you specify a different value inside the
rules. But if you pass UCOL_DEFAULT, it will use the strength option within
the rules and, if not specified, will fall back to the default strength.

I tested changing the parameter value to UCOL_DEFAULT, and it seems to work
as expected.

#2Peter Eisentraut
peter_e@gmx.net
In reply to: PG Bug reporting form (#1)
Re: BUG #18771: ICU custom collations with rules ignore collator strength option.

On 11.01.25 18:27, PG Bug reporting form wrote:

When using the 'rules' option of CREATE COLLATION to create a custom icu
collation it seems that, if you include inside the rules a change to the
comparison strength, it is ignored.

I think this is the same as this ICU bug:

https://unicode-org.atlassian.net/browse/ICU-22456

#3Ruben Ruiz
ruben.ruizcuadrado@gmail.com
In reply to: Peter Eisentraut (#2)
Re: BUG #18771: ICU custom collations with rules ignore collator strength option.

I think in this case it's not really related, as I'm not trying to copy
options from the base locale.

It all seems to come from some missing information on the official icu4c
docs. When describing the parameters of ucol_openRules() it says:

"strength: The default collation strength; one of UCOL_PRIMARY,
UCOL_SECONDARY, UCOL_TERTIARY, UCOL_IDENTICAL,UCOL_DEFAULT_STRENGTH - can
be also set in the rules"

And one could easily assume that if it "can also be set in the rules", you
could pass UCOL_DEFAULT_STRENGTH and the rules would take precedence. In no
place it does mention that UCOL_DEFAULT is a valid value for that
parameter, although it is mentioned for the normalizationMode. But, if you
look at icu4c sources (
https://github.com/unicode-org/icu/blob/f8aa68b0c1c9584633e7a61157185f1a2c275f58/icu4c/source/i18n/collationbuilder.cpp#L182),
you can find this:

RuleBasedCollator::internalBuildTailoring(const UnicodeString &rules,
int32_t strength,
UColAttributeValue
decompositionMode,
UParseError *outParseError,
UnicodeString *outReason,
UErrorCode &errorCode) {

...
// Set attributes after building the collator,
// to keep the default settings consistent with the rule string.
if(strength != UCOL_DEFAULT) {
setAttribute(UCOL_STRENGTH,
static_cast<UColAttributeValue>(strength), errorCode);
}
...
}

Which not only implies that UCOL_DEFAULT is a valid argument, but also that
if you don't pass UCOL_DEFAULT any 'strength' options will be overridden.
So it seems that the 'make_icu_collator' function inside postgres should
use UCOL_DEFAULT, to allow the rules to set the desired strength level,
instead of the current UCOL_DEFAULT_STRENGTH argument.

On Mon, 13 Jan 2025 at 17:42, Peter Eisentraut <peter@eisentraut.org> wrote:

Show quoted text

On 11.01.25 18:27, PG Bug reporting form wrote:

When using the 'rules' option of CREATE COLLATION to create a custom icu
collation it seems that, if you include inside the rules a change to the
comparison strength, it is ignored.

I think this is the same as this ICU bug:

https://unicode-org.atlassian.net/browse/ICU-22456