ICU locale validation / canonicalization

Started by Jeff Davisabout 3 years ago46 messageshackers

pgsql@j-davis.com

about 3 years ago

Right now, ICU locales are not validated:

initdb ... --locale-provider=icu --icu-locale=anything
CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything');
CREATE DATABASE anythingdb ICU_LOCALE 'anything';

all succeed.

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

At first, I thought this was a bug. The ICU documentation[1]https://unicode-org.github.io/icu/userguide/locale/#fallback suggests
that the fallback logic can result in using the ICU default locale in
some cases. The default locale is problematic because it's affected by
the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely
not LC_COLLATE).

Fortunately, I didn't find any cases where it actually does fall back
to the default locale, so I think we're safe, but validation seems wise
regrardless. In different contexts we may want to fail (e.g. initdb
with a bogus locale), or warn, issue a notice that we changed the
string, or just silently change what the user entered to be in a
consistent form. BCP47 [2]https://en.wikipedia.org/wiki/IETF_language_tag seems to be the standard here, and we're
already using it when importing the icu collations.

ICU locale validation is not exactly straightforward, though, and I
suppose that's why it isn't already done. There's a document[3]https://unicode-org.github.io/icu/userguide/locale/#canonicalization that
explains canonicalization in terms of "level 1" and "level 2", and says
that ucol_canonicalize() provides level 2 canonicalization, but I am
not seeing all of the documented behavior in my tests. For instance,
the document says that "de__PHONEBOOK" should canonicalize to
"de@collation=phonebook", but instead I see that it remains
"de__PHONEBOOK". It also says that "C" should canonicalize to
"en_US_POSIX", but in my test, it just goes to "c".

The right entry point appears to get uloc_getLanguageTag(), which
internally calls uloc_canonicalize, but also converts to BCP47 format,
and gives the option for strictness. Non-strict mode seems problematic
because for "de__PHONEBOOK", it returns a langtag of plain "de", which
is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize
worked as documented, it would have changed it to
"de@collation=phonebook" and the correct language tag "de-u-co-phonebk"
would be returned, which would find the right collator. I suppose that
means we would need to use strict mode.

And then we need to check whether it actually exists; i.e. reject well-
formed but bogus locales, like "wx-YZ". To do that, probably the most
straightforward way would be to initialize a UCollator and then query
it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results
in the root locale, we could say that it doesn't exist because it
failed to find a more suitable match (unless the user explicitly
requested the root locale). If it resolves to something else, we could
either just assume it's fine, or we could try to validate that it
matches what we expect in more detail. To be safe, we could double-
check that the resulting BCP 47 locale string loads the same actual
collator as what would have been loaded with the original string (also
check attributes?).

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

Thoughts?

[1]: https://unicode-org.github.io/icu/userguide/locale/#fallback
[2]: https://en.wikipedia.org/wiki/IETF_language_tag
[3]: https://unicode-org.github.io/icu/userguide/locale/#canonicalization
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#1)

Re: ICU locale validation / canonicalization

On 08.02.23 08:59, Jeff Davis wrote:

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

One use case is that if a user specifies a locale, say, of 'de-AT', this
might canonicalize to 'de' today, but we should still store what the
user specified because 1) that documents what the user wanted, and 2) it
might not canonicalize to the same thing tomorrow.

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#1)

Re: ICU locale validation / canonicalization

On Wed, Feb 8, 2023 at 2:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

I've noticed this, and I think it's really frustrating. There's barely
any documentation of what strings you're allowed to specify, and the
documentation that does exist is extremely difficult to understand.
Normally, you could work around that problem to some degree by making
a guess at what you're supposed to be doing and then seeing whether
the program accepts it, but here that doesn't work either. It just
accepts anything you give it and then you have to try to figure out
whether the behavior is what you wanted. But there's also no real
documentation of what the behavior of any collation is, so you're
apparently just supposed to magically know what collations exist and
how they behave and then you can test whether the string you put in
gave you the behavior you wanted.

Adding validation and canonicalization wouldn't cure the documentation
problems, but it would be a big help. You still wouldn't know what
string you were supposed to be passing to ICU, but if you did pass it
a string, you'd find out what it thought that string meant. I think
that would be a huge step forward.

Unfortunately, I have no idea whether your specific ideas about how to
make that happen are any good or not. But I hope they are, because the
current situation is pessimal.

--
Robert Haas
EDB: http://www.enterprisedb.com

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#2)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:

One use case is that if a user specifies a locale, say, of 'de-AT',
this
might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

but we should still store what the
user specified because 1) that documents what the user wanted, and 2)
it
might not canonicalize to the same thing tomorrow.

We don't want to store things with ambiguous interpretations that could
change tomorrow; that's a recipe for trouble. That's why most people
store timestamps as the offset from some epoch in UTC rather than as
"2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where
you would want to store something like that, but I don't see why they'd
apply in this case, where reinterpretation probably means a corrupted
index.

If the user wants to know how their ad-hoc string was interpreted, they
can look at the resulting BCP 47 language tag, and see if it's what
they meant. We can try to make this user-friendly by offering a NOTICE,
WARNING, or helper functions that allow them to explore. We can also
double check that the canonicalized form resolves to the same actual
collator to be safe, and maybe even fall back to whatever the user
specified if not. I'm open to discuss how strict we want to be and what
kind of escape hatches we need to offer.

There is still a risk that the BCP 47 language tag resolves to a
different specific ICU collator or different collator version tomorrow.
That's why we need to be careful about versioning (library versions or
collator versions or both), and we've had long discussions about that.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#3)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:

Unfortunately, I have no idea whether your specific ideas about how
to
make that happen are any good or not. But I hope they are, because
the
current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1]https://unicode-org.github.io/icu/userguide/locale/#canonicalization,
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]: https://unicode-org.github.io/icu/userguide/locale/#canonicalization
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Andreas Karlsson

andreas.karlsson@percona.com

about 3 years ago

In reply to: Jeff Davis (#5)

Re: ICU locale validation / canonicalization

On 2/9/23 23:09, Jeff Davis wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I have the same feeling one is readable and the other unreadable but the
unreadable one is standardized. Hard call.

And in general I agree, if we are going to make ICU default it needs to
be more user friendly than it is now. Currently there is no nice way to
understand if you entered the right locale or made a typo in the BCP47
syntax.

Andreas

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Andreas Karlsson (#6)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 01:04 +0100, Andreas Karlsson wrote:

I have the same feeling one is readable and the other unreadable but
the
unreadable one is standardized. Hard call.

And in general I agree, if we are going to make ICU default it needs
to
be more user friendly than it is now. Currently there is no nice way
to
understand if you entered the right locale or made a typo in the
BCP47
syntax.

We will still allow the ICU format locale IDs for input; we would just
convert them to BCP47 before storing them in the catalog. And there's
an inverse function, so it's easy enough to offer a view that shows the
ICU format locale IDs in addition to the BCP 47 tags.

I don't think it's hugely important that we use BCP47; ICU format
locale IDs would also make sense. But I do think we should be
consistent to simplify things where we can -- collator versioning is
hard enough without wondering how a user-entered string will be
interpreted. And if we're going to be consistent, BCP 47 seems like the
most obvious choice.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Andreas Karlsson

andreas.karlsson@percona.com

about 3 years ago

In reply to: Jeff Davis (#7)

Re: ICU locale validation / canonicalization

On 2/10/23 02:22, Jeff Davis wrote:

We will still allow the ICU format locale IDs for input; we would just
convert them to BCP47 before storing them in the catalog. And there's
an inverse function, so it's easy enough to offer a view that shows the
ICU format locale IDs in addition to the BCP 47 tags.

Aha, then I misread your mail. Sorry! BCP 47 sounds perfect for storage.

Andreas

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#4)

Re: ICU locale validation / canonicalization

On 09.02.23 22:15, Jeff Davis wrote:

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:

One use case is that if a user specifies a locale, say, of 'de-AT',
this
might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

It turns out that 'de_AT' is actually a distinct collation from 'de' in
CLDR, so that was not the best example. What behavior do you see for
'de_CH'?

#10

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#5)

Re: ICU locale validation / canonicalization

On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag).

Sadly, neither of those means a whole lot to me? :-(

How did you find out that those are equivalent?

And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

See, I'm confused, because that link says "If a keyword list is
present it must be preceded by an at-sign" which makes it sound like
it is talking about stuff like en_US@colstrength=primary rather than
stuff like en-US-u-ks-level1. The examples are all that way too, like
it gives examples like en_IE@currency=IEP and
fr@collation=phonebook;calendar=islamic-civil.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#9)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 07:42 +0100, Peter Eisentraut wrote:

It turns out that 'de_AT' is actually a distinct collation from 'de'
in
CLDR, so that was not the best example. What behavior do you see for
'de_CH'?

The canonicalized form is de_CH and the bcp47 tag is de-CH.

uloc_canonicalize() and uloc_getLanguageTag() are declared in uloc.h,
and they aren't (as far as I can tell) tied to which collations are
actually defined.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#12

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#10)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote:

On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-
ks-
level1" (the equivalent language tag).

Sadly, neither of those means a whole lot to me? :-(

How did you find out that those are equivalent?

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

Here's the API reference, which tells you that you can set the strength
of a collator (using the API, not the keyword):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a

The more precise definitions of the strengths are here:
https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

Regarding the equivalence of the two forms, uloc_toLanguageTag() and
uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree
of assurance than you are looking for), if one succeeds, then the other
will also succeed and produce the original result.

There are another couple documents here (TR35):
http://www.unicode.org/reports/tr35/
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
that seems to cover the "ks-level1" and how it maps to the collation
strength.

My examination of these standards is very superficial -- I'm basically
just checking that they seem to be there. If I search for a string like
"en-US-u-ks-level1", I only find Postgres-related results, so you could
also question whether these standards are actually used.

Using BCP 47 tags for icu locale strings, and moving to ICU (as
discussed in the other thread) is basically a leap of faith in ICU. The
docs aren't perfect, the source is hard to read, and we've found bugs.
But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

And the format is specified[1],
even though it's not an independent standard. But I think the
benefits
of better validation, an independent standard, and the fact that
we're
already favoring BCP47 outweigh my subjective opinion.

See, I'm confused, because that link says "If a keyword list is
present it must be preceded by an at-sign" which makes it sound like
it is talking about stuff like en_US@colstrength=primary rather than
stuff like en-US-u-ks-level1. The examples are all that way too, like
it gives examples like en_IE@currency=IEP and
fr@collation=phonebook;calendar=islamic-civil.

My paragraph was unclear, let me restate the point:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#13

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Jeff Davis (#12)

Re: ICU locale validation / canonicalization

On Fri, Feb 10, 2023 at 12:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

The fact that you're figuring out how it all works from reading the
source code does not give me a warm feeling.

But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

It may be. But sometimes I feel that's not setting our sights very high. :-(

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Robert Haas (#13)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 22:50 -0500, Robert Haas wrote:

The fact that you're figuring out how it all works from reading the
source code does not give me a warm feeling.

Right. On the other hand, the behavior is quite well documented, it was
just the keyword that was undocumented (or I didn't find it).

But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

It may be. But sometimes I feel that's not setting our sights very
high. :-(

How much higher could we set our sights? What would the ideal collation
provider look like?

Those are good questions, but please let's take those questions to the
thread about ICU as a default.

The topic of this thread is:

Given that we are already offering ICU support, should we canonicalize
the locale string stored in the catalog? If so, should we use the ICU
format locale IDs, or BCP 47 language tags?

Do you have an opinion on that topic? If not, do you need additional
information?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#15

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#5)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote:

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

Patch attached.

We should have been canonicalizing all along -- either with
uloc_toLanguageTag(), as this patch does, or at least with
uloc_canonicalize() -- before passing to ucol_open().

ucol_open() is documented[1]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6 to work on either language tags or ICU
format locale IDs. Anything else is invalid and ends up going through
some fallback logic, probably after being mis-parsed. For instance, in
ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid
language tag, and is resolved by ucol_open() to the actual locale
"root"; but if you canonicalize it first (to the ICU format locale ID
"fr_CA" or the language tag "fr-CA"), it correctly resolves to the
actual locale "fr_CA".

The correct thing to do is canonicalize first and then pass to
ucol_open().

But because we didn't canonicalize in the past, there could be raw
locale strings stored in the catalog that resolve to the wrong actual
collator, and there could be indexes depending on the wrong collator,
so we have to be careful during pg_upgrade.

Say someone created two ICU collations, one with locale "en_US.UTF-8"
and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16,
this patch will check the language tag "en-US" and see that it resolves
to the same locale as "en_US.UTF-8", and change to the language tag
during upgrade (so "en-US" will be in the new catalog). But when it
checks the language tag "fr-CA", it will notice that it resolves to a
different locale than "fr_CA.UTF-8", and keep the latter string even
though it's wrong, because some indexes might be dependent on that
wrong collator.

[1]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#16

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#12)

Re: ICU locale validation / canonicalization

On 10.02.23 18:53, Jeff Davis wrote:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

I found some discussion about this from when ICU support was first
added. See this message as a starting point:
/messages/by-id/5291804b-169e-3ba9-fdaf-fae8e7d2d959@2ndquadrant.com

There isn't much detail there, but the discussion and the current code
seem pretty convinced that

a) BCP47 tags are preferred, and
b) They don't work with ICU versions before 54.

I can't locate the source for claim b) anymore. However, it seems
pretty clear that there is some cutoff, even if it isn't exactly 54.

I would support transitioning this forward somehow, but we would need to
know exactly what the impact would be.

#17

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#16)

Re: ICU locale validation / canonicalization

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote:

a) BCP47 tags are preferred, and

Agreed.

b) They don't work with ICU versions before 54.

I tried in versions 50 through 53, and the language tags are supported,
but I think I know why we don't use them:

Prior to version 54, ICU would not set the collator attributes based on
the locale name. That is the same for either language tags or ICU
format locale IDs. However, for ICU format locale IDs, we added special
code to parse the locale string and set the attributes ourselves. We
didn't bother to add the same parsing logic for language tags, so if a
language tag is found in the catalog, the parts of it that specify
collation strength (for example) would be ignored. I don't know if
that's an actual problem when importing the system collations, because
I don't think we use any collator attributes, but it makes sense that
we'd not favor language tags in ICU prior to v54.

I would support transitioning this forward somehow, but we would need
to
know exactly what the impact would be.

I've done quite a bit of investigation, which I've described upthread.

We need to transition somehow, because the prior behavior is incorrect
for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable
thing to do, but if we pass that straight to ucol_open(), then it gets
misinterpreted as plain "fr" because it doesn't understand the "." as a
valid separator. We must turn it into a language tag (or at least
canonicalize it) before passing the string to ucol_open().

This misbehavior only affects a small number of locales, which resolve
to a different actual collator than they should. The most problematic
case is during pg_upgrade, where a slight behavior change would result
in corrupt indexes. So during binary upgrade, my patch falls back to
the original raw string (not the language tag) when it resolves to a
different actual collator. If we want to be more paranoid, we could
also provide a compatibility GUC to preserve the old misbehavior for
newly-created collations, too, but I don't think that's necessary.

There is also some interaction with pg_upgrade's ability to check
whether the old and new cluster are compatible. If the catalog
representation of the locale changes, then it could falsely believe the
icu locales aren't compatible, because it's doing a simple string
comparison. But as we are discussing in the other thread[1]/messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de, the whole
idea of checking for compatibility of the initialized cluster is
strange: pg_upgrade should be in charge of making a compatible cluster
to upgrade into (assuming the binaries are at least compatible). I
don't see this as a major problem; we'll sort out the other thread
first to allow ICU as the default, and then adapt this patch if
necessary.

[1]: /messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de
/messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#18

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#17)

Re: ICU locale validation / canonicalization

On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#19

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#18)

Re: ICU locale validation / canonicalization

On 28.02.23 06:57, Jeff Davis wrote:

On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear
exactly where the boundaries are between them and which ones we might
actually want. And I think the terminology also gets mixed up a bit,
which makes following this harder.

1. Canonicalizing the locale string. This is presumably what
uloc_canonicalize() does, which the patch doesn't actually use. What
are examples of what this does? Does the patch actually do this?

2. Converting the locale string to BCP 47 format. This converts
'de@collation=phonebook' to 'de-u-co-phonebk'. This is what
uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch.
Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale
strings are well-formed". What is "well-formed"? How does that relate
to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate
people to the "new" spellings, but the old ones aren't wrong. So this
should be a separate consideration, with an option, and it would require
various updates in the documentation. It also doesn't appear to address
how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?

#20

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#19)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote:

This patch appears to do about three things at once, and it's not
clear
exactly where the boundaries are between them and which ones we might
actually want. And I think the terminology also gets mixed up a bit,
which makes following this harder.

1. Canonicalizing the locale string. This is presumably what
uloc_canonicalize() does, which the patch doesn't actually use. What
are examples of what this does? Does the patch actually do this?

Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2
Canonicalization, which is described here:

https://unicode-org.github.io/icu/userguide/locale/#canonicalization

2. Converting the locale string to BCP 47 format. This converts
'de@collation=phonebook' to 'de-u-co-phonebk'. This is what
uloc_getLanguageTag() does.

Yes, though uloc_getLanguageTag() also canonicalizes. I consider
converting to the language tag a part of "canonicalization", because
it's the canonical form we agreed on in this thread.

3. Validating the locale string, to reject faulty input.

Canonicalization doesn't make sure the locale actually exists in ICU,
so it's easy to make a typo like "jp_JP" instead of "ja_JP". After
canonicalizing to a language tag, the former is "jp-JP" (resolving to
the collator with valid locale "root") and the latter is "ja-JP"
(resolving to the collator with valid locale "ja"). The former is
clearly a mistake, and I call catching that mistake "validation".

If the user specifies something other than the root locale (i.e. not
"root", "und", or ""), and the locale resolves to a collator with a
valid locale of "root", then this patch considers that to be a mistake
and issues a WARNING (upgraded to ERROR if the GUC
icu_locale_validation is true).

What are the relationships between these?

1 & 2 are closely related. If we canonicalize, we need to pick one
canonical form: either BCP 47 or ICU format locale IDs.

3 is related, but can be seen as an independent change.

I don't understand how the validation actually happens in your patch.
Does uloc_getLanguageTag() do the validation also?

Using the above definition of "validation" it happens inside
icu_collator_exists().

Can you do canonicalization without converting to language tag?

If we used uloc_canonicalize(), it would give us ICU format locale IDs,
and that would be a valid thing to do; and we could switch the
canonical form from ICU format locale IDs to BCP 47 in a separate
patch. I don't have a strong opinion, but if we're going to
canonicalize, I think it makes sense to go straight to language tags.

Can you do validation of un-canonicalized locale names?

Yes, though I feel like an un-canonicalized name is less stable in
meaning, and so validation on that name may also be less stable.

For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to
plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA".
Will the uncanonicalized name always resolve to "fr"? I'm not sure,
because the documentation says that ucol_open() expects either an ICU
format locale ID or, preferably, a language tag.

So they are technically independently useful changes, but I would
recommend that canonicalization goes in first.

What is the guidance for the use of the icu_locale_validation GUC?

If an error when creating a new collation or database due to a bad
locale name would be highly disruptive, leave it false. If such an
error would be helpful to make sure you get the locale you expect, then
turn it on. In practice, existing important production systems would
leave it off; new systems could turn it on to help avoid
misconfigurations/mistakes.

The description throws in yet another term: "validates that ICU
locale
strings are well-formed". What is "well-formed"? How does that
relate
to the other concepts?

Good point, I don't think I need to redefine "validation". Maybe I
should just describe it as elevating canonicalization or validation
problems from WARNING to ERROR.

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to
migrate
people to the "new" spellings, but the old ones aren't wrong.

I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.

But at the same time I can see it being annoying or confusing. If it's
confusing, perhaps a wording change and documentation would improve it?
If it's annoying, we might need to have an option and/or a different
log level?