ICU locale validation / canonicalization

Started by Jeff Davisalmost 3 years ago46 messages

pgsql@j-davis.com

almost 3 years ago

Right now, ICU locales are not validated:

initdb ... --locale-provider=icu --icu-locale=anything
CREATE COLLATION foo (PROVIDER=icu, LOCALE='anything');
CREATE DATABASE anythingdb ICU_LOCALE 'anything';

all succeed.

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

At first, I thought this was a bug. The ICU documentation[1]https://unicode-org.github.io/icu/userguide/locale/#fallback suggests
that the fallback logic can result in using the ICU default locale in
some cases. The default locale is problematic because it's affected by
the environment (LANG, LC_ALL, and strangely LC_MESSAGES; but strangely
not LC_COLLATE).

Fortunately, I didn't find any cases where it actually does fall back
to the default locale, so I think we're safe, but validation seems wise
regrardless. In different contexts we may want to fail (e.g. initdb
with a bogus locale), or warn, issue a notice that we changed the
string, or just silently change what the user entered to be in a
consistent form. BCP47 [2]https://en.wikipedia.org/wiki/IETF_language_tag seems to be the standard here, and we're
already using it when importing the icu collations.

ICU locale validation is not exactly straightforward, though, and I
suppose that's why it isn't already done. There's a document[3]https://unicode-org.github.io/icu/userguide/locale/#canonicalization that
explains canonicalization in terms of "level 1" and "level 2", and says
that ucol_canonicalize() provides level 2 canonicalization, but I am
not seeing all of the documented behavior in my tests. For instance,
the document says that "de__PHONEBOOK" should canonicalize to
"de@collation=phonebook", but instead I see that it remains
"de__PHONEBOOK". It also says that "C" should canonicalize to
"en_US_POSIX", but in my test, it just goes to "c".

The right entry point appears to get uloc_getLanguageTag(), which
internally calls uloc_canonicalize, but also converts to BCP47 format,
and gives the option for strictness. Non-strict mode seems problematic
because for "de__PHONEBOOK", it returns a langtag of plain "de", which
is a different actual locale than "de__PHONEBOOK". If uloc_canonicalize
worked as documented, it would have changed it to
"de@collation=phonebook" and the correct language tag "de-u-co-phonebk"
would be returned, which would find the right collator. I suppose that
means we would need to use strict mode.

And then we need to check whether it actually exists; i.e. reject well-
formed but bogus locales, like "wx-YZ". To do that, probably the most
straightforward way would be to initialize a UCollator and then query
it using ucol_getLocaleByType() with ULOC_VALID_LOCALE. If that results
in the root locale, we could say that it doesn't exist because it
failed to find a more suitable match (unless the user explicitly
requested the root locale). If it resolves to something else, we could
either just assume it's fine, or we could try to validate that it
matches what we expect in more detail. To be safe, we could double-
check that the resulting BCP 47 locale string loads the same actual
collator as what would have been loaded with the original string (also
check attributes?).

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

Thoughts?

[1]: https://unicode-org.github.io/icu/userguide/locale/#fallback
[2]: https://en.wikipedia.org/wiki/IETF_language_tag
[3]: https://unicode-org.github.io/icu/userguide/locale/#canonicalization
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#1)

Re: ICU locale validation / canonicalization

On 08.02.23 08:59, Jeff Davis wrote:

The overall benefit here is that we keep our catalogs consistently
using an independent standard format for ICU locale strings, rather
than whatever the user specifies. That makes it less likely that ICU
needs to use any fallback logic when trying to open a collator, which
could only be bad news.

One use case is that if a user specifies a locale, say, of 'de-AT', this
might canonicalize to 'de' today, but we should still store what the
user specified because 1) that documents what the user wanted, and 2) it
might not canonicalize to the same thing tomorrow.

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#1)

Re: ICU locale validation / canonicalization

On Wed, Feb 8, 2023 at 2:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

We do check that the value is accepted by ICU, but ICU seems to accept
anything and use some fallback logic. Bogus strings will typically end
up as the "root" locale (spelled "root" or "").

I've noticed this, and I think it's really frustrating. There's barely
any documentation of what strings you're allowed to specify, and the
documentation that does exist is extremely difficult to understand.
Normally, you could work around that problem to some degree by making
a guess at what you're supposed to be doing and then seeing whether
the program accepts it, but here that doesn't work either. It just
accepts anything you give it and then you have to try to figure out
whether the behavior is what you wanted. But there's also no real
documentation of what the behavior of any collation is, so you're
apparently just supposed to magically know what collations exist and
how they behave and then you can test whether the string you put in
gave you the behavior you wanted.

Adding validation and canonicalization wouldn't cure the documentation
problems, but it would be a big help. You still wouldn't know what
string you were supposed to be passing to ICU, but if you did pass it
a string, you'd find out what it thought that string meant. I think
that would be a huge step forward.

Unfortunately, I have no idea whether your specific ideas about how to
make that happen are any good or not. But I hope they are, because the
current situation is pessimal.

--
Robert Haas
EDB: http://www.enterprisedb.com

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#2)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:

One use case is that if a user specifies a locale, say, of 'de-AT',
this
might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

but we should still store what the
user specified because 1) that documents what the user wanted, and 2)
it
might not canonicalize to the same thing tomorrow.

We don't want to store things with ambiguous interpretations that could
change tomorrow; that's a recipe for trouble. That's why most people
store timestamps as the offset from some epoch in UTC rather than as
"2/9/23" (Feb 9 or Sept 2? 1923 or 2023?). There are exceptions where
you would want to store something like that, but I don't see why they'd
apply in this case, where reinterpretation probably means a corrupted
index.

If the user wants to know how their ad-hoc string was interpreted, they
can look at the resulting BCP 47 language tag, and see if it's what
they meant. We can try to make this user-friendly by offering a NOTICE,
WARNING, or helper functions that allow them to explore. We can also
double check that the canonicalized form resolves to the same actual
collator to be safe, and maybe even fall back to whatever the user
specified if not. I'm open to discuss how strict we want to be and what
kind of escape hatches we need to offer.

There is still a risk that the BCP 47 language tag resolves to a
different specific ICU collator or different collator version tomorrow.
That's why we need to be careful about versioning (library versions or
collator versions or both), and we've had long discussions about that.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Robert Haas (#3)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:

Unfortunately, I have no idea whether your specific ideas about how
to
make that happen are any good or not. But I hope they are, because
the
current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1]https://unicode-org.github.io/icu/userguide/locale/#canonicalization,
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]: https://unicode-org.github.io/icu/userguide/locale/#canonicalization
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Andreas Karlsson

andreas@proxel.se

almost 3 years ago

In reply to: Jeff Davis (#5)

Re: ICU locale validation / canonicalization

On 2/9/23 23:09, Jeff Davis wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I have the same feeling one is readable and the other unreadable but the
unreadable one is standardized. Hard call.

And in general I agree, if we are going to make ICU default it needs to
be more user friendly than it is now. Currently there is no nice way to
understand if you entered the right locale or made a typo in the BCP47
syntax.

Andreas

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Andreas Karlsson (#6)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 01:04 +0100, Andreas Karlsson wrote:

I have the same feeling one is readable and the other unreadable but
the
unreadable one is standardized. Hard call.

And in general I agree, if we are going to make ICU default it needs
to
be more user friendly than it is now. Currently there is no nice way
to
understand if you entered the right locale or made a typo in the
BCP47
syntax.

We will still allow the ICU format locale IDs for input; we would just
convert them to BCP47 before storing them in the catalog. And there's
an inverse function, so it's easy enough to offer a view that shows the
ICU format locale IDs in addition to the BCP 47 tags.

I don't think it's hugely important that we use BCP47; ICU format
locale IDs would also make sense. But I do think we should be
consistent to simplify things where we can -- collator versioning is
hard enough without wondering how a user-entered string will be
interpreted. And if we're going to be consistent, BCP 47 seems like the
most obvious choice.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Andreas Karlsson

andreas@proxel.se

almost 3 years ago

In reply to: Jeff Davis (#7)

Re: ICU locale validation / canonicalization

On 2/10/23 02:22, Jeff Davis wrote:

We will still allow the ICU format locale IDs for input; we would just
convert them to BCP47 before storing them in the catalog. And there's
an inverse function, so it's easy enough to offer a view that shows the
ICU format locale IDs in addition to the BCP 47 tags.

Aha, then I misread your mail. Sorry! BCP 47 sounds perfect for storage.

Andreas

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#4)

Re: ICU locale validation / canonicalization

On 09.02.23 22:15, Jeff Davis wrote:

On Thu, 2023-02-09 at 15:44 +0100, Peter Eisentraut wrote:

One use case is that if a user specifies a locale, say, of 'de-AT',
this
might canonicalize to 'de' today,

Canonicalization should not lose useful information, it should just
rearrange it, so I don't see a risk here based on what I read and the
behavior I saw. In ICU, "de-AT" canonicalizes to "de_AT" and becomes
the language tag "de-AT".

It turns out that 'de_AT' is actually a distinct collation from 'de' in
CLDR, so that was not the best example. What behavior do you see for
'de_CH'?

#10

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#5)

Re: ICU locale validation / canonicalization

On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag).

Sadly, neither of those means a whole lot to me? :-(

How did you find out that those are equivalent?

And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

See, I'm confused, because that link says "If a keyword list is
present it must be preceded by an at-sign" which makes it sound like
it is talking about stuff like en_US@colstrength=primary rather than
stuff like en-US-u-ks-level1. The examples are all that way too, like
it gives examples like en_IE@currency=IEP and
fr@collation=phonebook;calendar=islamic-civil.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#9)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 07:42 +0100, Peter Eisentraut wrote:

It turns out that 'de_AT' is actually a distinct collation from 'de'
in
CLDR, so that was not the best example. What behavior do you see for
'de_CH'?

The canonicalized form is de_CH and the bcp47 tag is de-CH.

uloc_canonicalize() and uloc_getLanguageTag() are declared in uloc.h,
and they aren't (as far as I can tell) tied to which collations are
actually defined.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#12

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Robert Haas (#10)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 09:43 -0500, Robert Haas wrote:

On Thu, Feb 9, 2023 at 5:09 PM Jeff Davis <pgsql@j-davis.com> wrote:

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-
ks-
level1" (the equivalent language tag).

Sadly, neither of those means a whole lot to me? :-(

How did you find out that those are equivalent?

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

Here's the API reference, which tells you that you can set the strength
of a collator (using the API, not the keyword):
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#acc801048729e684bcabed328be85f77a

The more precise definitions of the strengths are here:
https://unicode-org.github.io/icu/userguide/collation/concepts.html#comparison-levels

Regarding the equivalence of the two forms, uloc_toLanguageTag() and
uloc_toLanguageTag() are inverses. As far as I can tell (a lower degree
of assurance than you are looking for), if one succeeds, then the other
will also succeed and produce the original result.

There are another couple documents here (TR35):
http://www.unicode.org/reports/tr35/
https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
that seems to cover the "ks-level1" and how it maps to the collation
strength.

My examination of these standards is very superficial -- I'm basically
just checking that they seem to be there. If I search for a string like
"en-US-u-ks-level1", I only find Postgres-related results, so you could
also question whether these standards are actually used.

Using BCP 47 tags for icu locale strings, and moving to ICU (as
discussed in the other thread) is basically a leap of faith in ICU. The
docs aren't perfect, the source is hard to read, and we've found bugs.
But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

And the format is specified[1],
even though it's not an independent standard. But I think the
benefits
of better validation, an independent standard, and the fact that
we're
already favoring BCP47 outweigh my subjective opinion.

See, I'm confused, because that link says "If a keyword list is
present it must be preceded by an at-sign" which makes it sound like
it is talking about stuff like en_US@colstrength=primary rather than
stuff like en-US-u-ks-level1. The examples are all that way too, like
it gives examples like en_IE@currency=IEP and
fr@collation=phonebook;calendar=islamic-civil.

My paragraph was unclear, let me restate the point:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#13

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#12)

Re: ICU locale validation / canonicalization

On Fri, Feb 10, 2023 at 12:54 PM Jeff Davis <pgsql@j-davis.com> wrote:

In our tests you can see colstrength=primary is used to mean "case
insensitive". That's where I picked up the "colstrength" keyword, which
is also present in the ICU sources, but now that you ask I'm embarassed
that I don't see the keyword itself documented very well.

This document
https://unicode-org.github.io/icu/userguide/locale/#keywords
lists keywords, but colstrength is not there. It's easy enough to find
in the ICU source; I'm probably just missing the document.

The fact that you're figuring out how it all works from reading the
source code does not give me a warm feeling.

But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

It may be. But sometimes I feel that's not setting our sights very high. :-(

--
Robert Haas
EDB: http://www.enterprisedb.com

#14

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Robert Haas (#13)

Re: ICU locale validation / canonicalization

On Fri, 2023-02-10 at 22:50 -0500, Robert Haas wrote:

The fact that you're figuring out how it all works from reading the
source code does not give me a warm feeling.

Right. On the other hand, the behavior is quite well documented, it was
just the keyword that was undocumented (or I didn't find it).

But it seems like a better place for us than libc for the reasons I
mentioned in the other thread.

It may be. But sometimes I feel that's not setting our sights very
high. :-(

How much higher could we set our sights? What would the ideal collation
provider look like?

Those are good questions, but please let's take those questions to the
thread about ICU as a default.

The topic of this thread is:

Given that we are already offering ICU support, should we canonicalize
the locale string stored in the catalog? If so, should we use the ICU
format locale IDs, or BCP 47 language tags?

Do you have an opinion on that topic? If not, do you need additional
information?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#15

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#5)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote:

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

Patch attached.

We should have been canonicalizing all along -- either with
uloc_toLanguageTag(), as this patch does, or at least with
uloc_canonicalize() -- before passing to ucol_open().

ucol_open() is documented[1]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6 to work on either language tags or ICU
format locale IDs. Anything else is invalid and ends up going through
some fallback logic, probably after being mis-parsed. For instance, in
ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid
language tag, and is resolved by ucol_open() to the actual locale
"root"; but if you canonicalize it first (to the ICU format locale ID
"fr_CA" or the language tag "fr-CA"), it correctly resolves to the
actual locale "fr_CA".

The correct thing to do is canonicalize first and then pass to
ucol_open().

But because we didn't canonicalize in the past, there could be raw
locale strings stored in the catalog that resolve to the wrong actual
collator, and there could be indexes depending on the wrong collator,
so we have to be careful during pg_upgrade.

Say someone created two ICU collations, one with locale "en_US.UTF-8"
and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16,
this patch will check the language tag "en-US" and see that it resolves
to the same locale as "en_US.UTF-8", and change to the language tag
during upgrade (so "en-US" will be in the new catalog). But when it
checks the language tag "fr-CA", it will notice that it resolves to a
different locale than "fr_CA.UTF-8", and keep the latter string even
though it's wrong, because some indexes might be dependent on that
wrong collator.

[1]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-For-ICU-collations-canonicalize-locale-names-to-l.patchtext/x-patch; charset=UTF-8; name=v1-0001-For-ICU-collations-canonicalize-locale-names-to-l.patchDownload

From bb873845bf1c9bc019606a691d66c2d148a745e2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Feb 2023 23:05:08 -0800
Subject: [PATCH v1] For ICU collations, canonicalize locale names to language
 tags.

Before storing the locale name in the catalog, convert to a BCP47
language tag. The language tag should hold all of the necessary
information and also be an unambiguous representation of the locale.

During pg_upgrade, the previous locale string may need to be preserved
if the language tag resolves to a different actual locale.
---
 src/backend/commands/collationcmds.c |  39 ++++----
 src/backend/commands/dbcommands.c    |  58 +++++++++++
 src/backend/utils/adt/pg_locale.c    | 139 +++++++++++++++++++++++++--
 src/include/commands/dbcommands.h    |   1 +
 src/include/utils/pg_locale.h        |   3 +
 5 files changed, 211 insertions(+), 29 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index eb62d285ea..3a98bbbd35 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -240,10 +240,23 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char *iculocale;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+			iculocale = get_icu_locale(colliculocale);
+			if (iculocale)
+				colliculocale = iculocale;
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
@@ -556,26 +569,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -938,7 +931,11 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name);
+			if (langtag == NULL)
+				ereport(ERROR,
+						(errmsg("could not convert locale name \"%s\" to language tag",
+								name)));
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index ef05633bb0..ccb5304250 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1029,6 +1029,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char *iculocale;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1045,6 +1048,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+		iculocale = get_icu_locale(dbiculocale);
+		if (iculocale)
+			dbiculocale = iculocale;
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
@@ -1463,6 +1474,53 @@ check_encoding_locale_matches(int encoding, const char *collate, const char *cty
 						   pg_encoding_to_char(collate_encoding))));
 }
 
+/*
+ * Given the input ICU locale string, return a new string in a form suitable
+ * for storing in the catalog.
+ *
+ * Ordinarily this just converts to a language tag, but we need to make an
+ * allowance for invalid locale strings that come from earlier versions of
+ * Postgres while upgrading.
+ *
+ * Converting to a language tag performs "level 2 canonicalization". In
+ * addition to producing a consistent result format, level 2 canonicalization
+ * is able to more accurately interpret different input locale string formats,
+ * such as POSIX and .NET IDs. But prior to Postgres version 16, input locale
+ * strings were not canonicalized; the raw string provided by the user was
+ * stored in the catalog and passed directly to ucol_open().
+ *
+ * The raw string may resolve to the wrong actual collator when passed to
+ * directly ucol_open(), but indexes in older versions may depend on that
+ * actual collator. Therefore, during binary upgrade, we preserve the invalid
+ * raw string if it resolves to a different actual collator than the language
+ * tag. If it resolves to the same actual collator, then we proceed using the
+ * language tag.
+ */
+char *
+get_icu_locale(const char *requested_locale)
+{
+#ifdef USE_ICU
+	char *lang_tag = icu_language_tag(requested_locale);
+
+	if (lang_tag != NULL && IsBinaryUpgrade &&
+		!check_equivalent_icu_locales(requested_locale, lang_tag))
+	{
+		ereport(WARNING,
+				(errmsg("language tag \"%s\" resolves to different actual collator "
+						"than raw locale string \"%s\"",
+						lang_tag, requested_locale)));
+		pfree(lang_tag);
+		return pstrdup(requested_locale);
+	}
+
+	return lang_tag;
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
+}
+
 /* Error cleanup callback for createdb */
 static void
 createdb_failure_callback(int code, Datum arg)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..52d9c8b5a6 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1945,15 +1945,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	}
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
@@ -1967,13 +1964,139 @@ check_icu_locale(const char *icu_locale)
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
 		icu_set_collation_attributes(collator, icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+/*
+ * Test if the given locales resolve to the same actual collator with the same
+ * attributes and version.
+ */
+bool
+check_equivalent_icu_locales(const char *locale1, const char *locale2)
+{
+	const UColAttribute collAtt[]	   = {
+		UCOL_FRENCH_COLLATION,
+		UCOL_ALTERNATE_HANDLING,
+		UCOL_CASE_FIRST,
+		UCOL_CASE_LEVEL,
+		UCOL_NORMALIZATION_MODE,
+		UCOL_DECOMPOSITION_MODE,
+		UCOL_STRENGTH,
+		UCOL_HIRAGANA_QUATERNARY_MODE,
+		UCOL_NUMERIC_COLLATION};
+	int              n_collAtt		   = sizeof(collAtt)/sizeof(*collAtt);
+	const char		*actual1, *actual2;
+	UVersionInfo     versionInfo1;
+	UVersionInfo	 versionInfo2;
+	char             version1[U_MAX_VERSION_STRING_LENGTH];
+	char             version2[U_MAX_VERSION_STRING_LENGTH];
+	UCollator		*collator1		   = NULL;
+	UCollator		*collator2		   = NULL;
+	UErrorCode		 status;
+	bool			 result			   = false;
+
+	/*
+	 * Be careful not to return without closing the collators.
+	 */
+
+	status = U_ZERO_ERROR;
+	collator1 = ucol_open(locale1, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	status = U_ZERO_ERROR;
+	collator2 = ucol_open(locale2, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	/* actual locale */
+	status = U_ZERO_ERROR;
+	actual1 = ucol_getLocaleByType(collator1, ULOC_ACTUAL_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	status = U_ZERO_ERROR;
+	actual2 = ucol_getLocaleByType(collator2, ULOC_ACTUAL_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	if (strcmp(actual1, actual2) != 0)
+		goto cleanup;
+
+	/* version */
+	ucol_getVersion(collator1, versionInfo1);
+	u_versionToString(versionInfo1, version1);
+	ucol_getVersion(collator2, versionInfo2);
+	u_versionToString(versionInfo2, version2);
+	if (strcmp(version1, version2) != 0)
+		goto cleanup;
+
+	/* attributes */
+	for (int i = 0; i < n_collAtt; i++)
+	{
+		UColAttributeValue val1, val2;
+
+		status = U_ZERO_ERROR;
+		val1 = ucol_getAttribute(collator1, collAtt[i], &status);
+		if (U_FAILURE(status))
+			goto cleanup;
+
+		status = U_ZERO_ERROR;
+		val2 = ucol_getAttribute(collator2, collAtt[i], &status);
+		if (U_FAILURE(status))
+			goto cleanup;
+
+		if (val1 != val2)
+			goto cleanup;
+	}
+
+	/* passed all the best-effort checks for equivalence */
+	result = true;
+
+cleanup:
+	if (collator2)
+		ucol_close(collator2);
+	if (collator1)
+		ucol_close(collator1);
+
+	return result;
+}
+
+/*
+ * Return the BCP47 language tag representation of the requested locale; or
+ * NULL if a problem is encountered.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent result format,
+ * level 2 canonicalization is able to more accurately interpret different
+ * input locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *requested_locale)
+{
+	UErrorCode	 status;
+	char		*result;
+	int32_t		 len;
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	len = uloc_toLanguageTag(requested_locale, NULL, 0, strict, &status);
+
+	result = palloc(len + 1);
+
+	status = U_ZERO_ERROR;
+	uloc_toLanguageTag(requested_locale, result, len + 1, strict, &status);
+	if (U_FAILURE(status))
+	{
+		pfree(result);
+		return NULL;
+	}
+
+	return result;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 5fbc3ca752..0f0e827ff2 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -33,5 +33,6 @@ extern char *get_database_name(Oid dbid);
 extern bool have_createdb_privilege(void);
 
 extern void check_encoding_locale_matches(int encoding, const char *collate, const char *ctype);
+extern char *get_icu_locale(const char *requested_locale);
 
 #endif							/* DBCOMMANDS_H */
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..5f9626985f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -104,8 +104,11 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern bool check_equivalent_icu_locales(const char *locale1,
+										 const char *locale2);
 #endif
 extern void check_icu_locale(const char *icu_locale);
+extern char *icu_language_tag(const char *requested_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
 extern size_t wchar2char(char *to, const wchar_t *from, size_t tolen,
-- 
2.34.1

#16

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#12)

Re: ICU locale validation / canonicalization

On 10.02.23 18:53, Jeff Davis wrote:

To represent ICU locale strings in the catalog consistently, we have
two choices, which as far as I can tell are equivalent:

1. ICU format Locale IDs. These are more readable, and still specified
(albeit non-standard).

2. BCP47 language tags. These are standardized, there's better
validation with "strict" mode, and we are already using them.

Honestly I don't think it's hugely important which one we pick. But
being consistent is important, so we need to pick one, and BCP 47 seems
like the better option to me.

I found some discussion about this from when ICU support was first
added. See this message as a starting point:
/messages/by-id/5291804b-169e-3ba9-fdaf-fae8e7d2d959@2ndquadrant.com

There isn't much detail there, but the discussion and the current code
seem pretty convinced that

a) BCP47 tags are preferred, and
b) They don't work with ICU versions before 54.

I can't locate the source for claim b) anymore. However, it seems
pretty clear that there is some cutoff, even if it isn't exactly 54.

I would support transitioning this forward somehow, but we would need to
know exactly what the impact would be.

#17

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#16)

1 attachment(s)

Re: ICU locale validation / canonicalization

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

On Mon, 2023-02-20 at 15:46 +0100, Peter Eisentraut wrote:

a) BCP47 tags are preferred, and

Agreed.

b) They don't work with ICU versions before 54.

I tried in versions 50 through 53, and the language tags are supported,
but I think I know why we don't use them:

Prior to version 54, ICU would not set the collator attributes based on
the locale name. That is the same for either language tags or ICU
format locale IDs. However, for ICU format locale IDs, we added special
code to parse the locale string and set the attributes ourselves. We
didn't bother to add the same parsing logic for language tags, so if a
language tag is found in the catalog, the parts of it that specify
collation strength (for example) would be ignored. I don't know if
that's an actual problem when importing the system collations, because
I don't think we use any collator attributes, but it makes sense that
we'd not favor language tags in ICU prior to v54.

I would support transitioning this forward somehow, but we would need
to
know exactly what the impact would be.

I've done quite a bit of investigation, which I've described upthread.

We need to transition somehow, because the prior behavior is incorrect
for locales like "fr_CA.UTF-8". Our tests suggest that's an acceptable
thing to do, but if we pass that straight to ucol_open(), then it gets
misinterpreted as plain "fr" because it doesn't understand the "." as a
valid separator. We must turn it into a language tag (or at least
canonicalize it) before passing the string to ucol_open().

This misbehavior only affects a small number of locales, which resolve
to a different actual collator than they should. The most problematic
case is during pg_upgrade, where a slight behavior change would result
in corrupt indexes. So during binary upgrade, my patch falls back to
the original raw string (not the language tag) when it resolves to a
different actual collator. If we want to be more paranoid, we could
also provide a compatibility GUC to preserve the old misbehavior for
newly-created collations, too, but I don't think that's necessary.

There is also some interaction with pg_upgrade's ability to check
whether the old and new cluster are compatible. If the catalog
representation of the locale changes, then it could falsely believe the
icu locales aren't compatible, because it's doing a simple string
comparison. But as we are discussing in the other thread[1]/messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de, the whole
idea of checking for compatibility of the initialized cluster is
strange: pg_upgrade should be in charge of making a compatible cluster
to upgrade into (assuming the binaries are at least compatible). I
don't see this as a major problem; we'll sort out the other thread
first to allow ICU as the default, and then adapt this patch if
necessary.

[1]: /messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de
/messages/by-id/20230214175957.idkb7shsqzp5nbll@awork3.anarazel.de

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v2-0001-ICU-locale-string-canonicalization-and-validation.patchtext/x-patch; charset=UTF-8; name=v2-0001-ICU-locale-string-canonicalization-and-validation.patchDownload

From b2e42cef9d8080ad27ef76444b74a72e5cda922c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Feb 2023 23:05:08 -0800
Subject: [PATCH v2] ICU locale string canonicalization and validation.

Before storing the locale name in the catalog, convert to a BCP47
language tag. The language tag should hold all of the necessary
information and also be an unambiguous representation of the locale.

Also, add a new GUC icu_locale_validation. When set to true, it raises
an ERROR if the locale string is malformed or if it is not a valid
locale in ICU.

During pg_upgrade, the previous locale string may need to be preserved
if the language tag resolves to a different actual locale.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      |  16 ++
 src/backend/commands/collationcmds.c          |  63 +++--
 src/backend/commands/dbcommands.c             |  81 +++++++
 src/backend/utils/adt/pg_locale.c             | 226 +++++++++++++++++-
 src/backend/utils/misc/guc_tables.c           |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/commands/dbcommands.h             |   1 +
 src/include/utils/pg_locale.h                 |   4 +
 .../regress/expected/collate.icu.utf8.out     |  16 ++
 src/test/regress/sql/collate.icu.utf8.sql     |   5 +
 10 files changed, 395 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ecd9aa73ef..d137159532 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9775,6 +9775,22 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        If set to <literal>true</literal>, validates that ICU locale strings
+        are well-formed, and that they represent valid locale in ICU. Does not
+        cause any locale string to be rejected during <xref
+        linkend="pgupgrade"/>. The default is <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index eb62d285ea..d1fc46777c 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -47,6 +47,8 @@ typedef struct
 	int			enc;			/* encoding */
 } CollAliasData;
 
+extern bool icu_locale_validation;
+
 
 /*
  * CREATE COLLATION
@@ -240,10 +242,45 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+			int		 elevel = WARNING;
+
+			/* can't reject previously-accepted locales during upgrade */
+			if (!IsBinaryUpgrade && icu_locale_validation)
+				elevel = ERROR;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+			langtag = get_icu_locale(colliculocale);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, colliculocale)));
+
+				if (!icu_collator_exists(langtag))
+					ereport(elevel,
+							(errmsg("ICU collator for language tag \"%s\" not found",
+									langtag)));
+
+				colliculocale = langtag;
+			}
+			else
+			{
+				ereport(elevel,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								colliculocale)));
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
@@ -556,26 +593,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -938,7 +955,11 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name);
+			if (langtag == NULL)
+				ereport(ERROR,
+						(errmsg("could not convert locale name \"%s\" to language tag",
+								name)));
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index ef05633bb0..805e754dcb 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -109,6 +109,7 @@ typedef struct CreateDBRelInfo
 	bool		permanent;		/* relation is permanent or unlogged */
 } CreateDBRelInfo;
 
+extern bool icu_locale_validation;
 
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
@@ -1029,6 +1030,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+		int		 elevel = WARNING;
+
+		/* can't reject previously-accepted locales during upgrade */
+		if (!IsBinaryUpgrade && icu_locale_validation)
+			elevel = ERROR;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1045,6 +1054,31 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+		langtag = get_icu_locale(dbiculocale);
+		if (langtag)
+		{
+			ereport(NOTICE,
+					(errmsg("using language tag \"%s\" for locale \"%s\"",
+							langtag, dbiculocale)));
+
+			if (!icu_collator_exists(langtag))
+				ereport(elevel,
+						(errmsg("ICU collator for language tag \"%s\" not found",
+								langtag)));
+
+			dbiculocale = langtag;
+		}
+		else
+		{
+			ereport(elevel,
+					(errmsg("could not convert locale \"%s\" to language tag",
+							dbiculocale)));
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
@@ -1463,6 +1497,53 @@ check_encoding_locale_matches(int encoding, const char *collate, const char *cty
 						   pg_encoding_to_char(collate_encoding))));
 }
 
+/*
+ * Given the input ICU locale string, return a new string in a form suitable
+ * for storing in the catalog.
+ *
+ * Ordinarily this just converts to a language tag, but we need to make an
+ * allowance for invalid locale strings that come from earlier versions of
+ * Postgres while upgrading.
+ *
+ * Converting to a language tag performs "level 2 canonicalization". In
+ * addition to producing a consistent result format, level 2 canonicalization
+ * is able to more accurately interpret different input locale string formats,
+ * such as POSIX and .NET IDs. But prior to Postgres version 16, input locale
+ * strings were not canonicalized; the raw string provided by the user was
+ * stored in the catalog and passed directly to ucol_open().
+ *
+ * The raw string may resolve to the wrong actual collator when passed to
+ * directly ucol_open(), but indexes in older versions may depend on that
+ * actual collator. Therefore, during binary upgrade, we preserve the invalid
+ * raw string if it resolves to a different actual collator than the language
+ * tag. If it resolves to the same actual collator, then we proceed using the
+ * language tag.
+ */
+char *
+get_icu_locale(const char *requested_locale)
+{
+#ifdef USE_ICU
+	char *lang_tag = icu_language_tag(requested_locale);
+
+	if (lang_tag != NULL && IsBinaryUpgrade &&
+		!check_equivalent_icu_locales(requested_locale, lang_tag))
+	{
+		ereport(WARNING,
+				(errmsg("language tag \"%s\" resolves to different actual collator "
+						"than raw locale string \"%s\"",
+						lang_tag, requested_locale)));
+		pfree(lang_tag);
+		return NULL;
+	}
+
+	return lang_tag;
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
+}
+
 /* Error cleanup callback for createdb */
 static void
 createdb_failure_callback(int code, Datum arg)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..1764e51645 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1945,15 +1945,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	}
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
@@ -1967,13 +1964,226 @@ check_icu_locale(const char *icu_locale)
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
 		icu_set_collation_attributes(collator, icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+/*
+ * Test if the given locales resolve to the same actual collator with the same
+ * attributes and version.
+ */
+bool
+check_equivalent_icu_locales(const char *locale1, const char *locale2)
+{
+	const UColAttribute collAtt[]	   = {
+		UCOL_FRENCH_COLLATION,
+		UCOL_ALTERNATE_HANDLING,
+		UCOL_CASE_FIRST,
+		UCOL_CASE_LEVEL,
+		UCOL_NORMALIZATION_MODE,
+		UCOL_DECOMPOSITION_MODE,
+		UCOL_STRENGTH,
+		UCOL_HIRAGANA_QUATERNARY_MODE,
+		UCOL_NUMERIC_COLLATION};
+	int              n_collAtt		   = sizeof(collAtt)/sizeof(*collAtt);
+	const char		*actual1, *actual2;
+	UVersionInfo     versionInfo1;
+	UVersionInfo	 versionInfo2;
+	char             version1[U_MAX_VERSION_STRING_LENGTH];
+	char             version2[U_MAX_VERSION_STRING_LENGTH];
+	UCollator		*collator1		   = NULL;
+	UCollator		*collator2		   = NULL;
+	UErrorCode		 status;
+	bool			 result			   = false;
+
+	/*
+	 * Be careful not to return without closing the collators.
+	 */
+
+	status = U_ZERO_ERROR;
+	collator1 = ucol_open(locale1, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	status = U_ZERO_ERROR;
+	collator2 = ucol_open(locale2, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	/* actual locale */
+	status = U_ZERO_ERROR;
+	actual1 = ucol_getLocaleByType(collator1, ULOC_ACTUAL_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	status = U_ZERO_ERROR;
+	actual2 = ucol_getLocaleByType(collator2, ULOC_ACTUAL_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	if (strcmp(actual1, actual2) != 0)
+		goto cleanup;
+
+	/* version */
+	ucol_getVersion(collator1, versionInfo1);
+	u_versionToString(versionInfo1, version1);
+	ucol_getVersion(collator2, versionInfo2);
+	u_versionToString(versionInfo2, version2);
+	if (strcmp(version1, version2) != 0)
+		goto cleanup;
+
+	/* attributes */
+	for (int i = 0; i < n_collAtt; i++)
+	{
+		UColAttributeValue val1, val2;
+
+		status = U_ZERO_ERROR;
+		val1 = ucol_getAttribute(collator1, collAtt[i], &status);
+		if (U_FAILURE(status))
+			goto cleanup;
+
+		status = U_ZERO_ERROR;
+		val2 = ucol_getAttribute(collator2, collAtt[i], &status);
+		if (U_FAILURE(status))
+			goto cleanup;
+
+		if (val1 != val2)
+			goto cleanup;
+	}
+
+	/* passed all the best-effort checks for equivalence */
+	result = true;
+
+cleanup:
+	if (collator2)
+		ucol_close(collator2);
+	if (collator1)
+		ucol_close(collator1);
+
+	return result;
+}
+
+static char *
+get_lang_part(const char *locale)
+{
+	UErrorCode	 status;
+	char		*lang_part;
+	int32_t		 len;
+
+	status = U_ZERO_ERROR;
+	len  = uloc_getLanguage(locale, NULL, 0, &status);
+	lang_part = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(locale, lang_part, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not get language name from locale string \"%s\": %s",
+						locale, u_errorName(status))));
+	return lang_part;
+}
+
+/*
+ * Check if the locale string represents the root locale. It represents the
+ * root locale if the language part is "und", "root", or the empty string.
+ */
+static bool
+icu_is_root_locale(const char *locale)
+{
+	char		*lang_part = get_lang_part(locale);
+	bool		 result	   = false;
+
+	if (strcasecmp(lang_part, "root") == 0 ||
+		strcasecmp(lang_part, "und") == 0 ||
+		strcasecmp(lang_part, "") == 0)
+		result = true;
+
+	pfree(lang_part);
+	return result;
+}
+
+/*
+ * Check if the locale string represents the C/POSIX locale, which is not
+ * handled by ICU level 2 canonicalization.
+ */
+static bool
+icu_is_c_posix_locale(const char *locale)
+{
+	char		*lang_part = get_lang_part(locale);
+	bool		 result	   = false;
+
+	if (strcasecmp(lang_part, "c") == 0 ||
+		strcasecmp(lang_part, "posix") == 0)
+		result = true;
+
+	pfree(lang_part);
+	return result;
+}
+
+bool
+icu_collator_exists(const char *requested_locale)
+{
+	UCollator	*collator;
+	const char	*valid_locale = NULL;
+	UErrorCode	 status;
+	bool		 result		  = false;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(requested_locale, &status);
+	if (U_FAILURE(status))
+		return false;
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	if (icu_is_root_locale(requested_locale) ||
+		!icu_is_root_locale(valid_locale))
+		result = true;
+
+cleanup:
+	ucol_close(collator);
+	return result;
+}
+
+/*
+ * Return the BCP47 language tag representation of the requested locale; or
+ * NULL if a problem is encountered.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent result format,
+ * level 2 canonicalization is able to more accurately interpret different
+ * input locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *requested_locale)
+{
+	UErrorCode	 status;
+	char		*result;
+	int32_t		 len;
+	const bool	 strict = true;
+
+	/* c/posix locales aren't handled by uloc_getLanguageTag() */
+	if (icu_is_c_posix_locale(requested_locale))
+		return pstrdup("en-US-u-va-posix");
+
+	status = U_ZERO_ERROR;
+	len = uloc_toLanguageTag(requested_locale, NULL, 0, strict, &status);
+
+	result = palloc(len + 1);
+
+	status = U_ZERO_ERROR;
+	uloc_toLanguageTag(requested_locale, result, len + 1, strict, &status);
+	if (U_FAILURE(status))
+	{
+		pfree(result);
+		return NULL;
+	}
+
+	return result;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..4aa53259dc 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -481,6 +481,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		icu_locale_validation = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_SUSET, CLIENT_CONN_LOCALE,
+			gettext_noop("Validate ICU locale strings."),
+			NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 5fbc3ca752..0f0e827ff2 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -33,5 +33,6 @@ extern char *get_database_name(Oid dbid);
 extern bool have_createdb_privilege(void);
 
 extern void check_encoding_locale_matches(int encoding, const char *collate, const char *ctype);
+extern char *get_icu_locale(const char *requested_locale);
 
 #endif							/* DBCOMMANDS_H */
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..e1bb017a54 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -104,8 +104,12 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern bool check_equivalent_icu_locales(const char *locale1,
+										 const char *locale2);
 #endif
 extern void check_icu_locale(const char *icu_locale);
+extern bool icu_collator_exists(const char *requested_locale);
+extern char *icu_language_tag(const char *requested_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
 extern size_t wchar2char(char *to, const wchar_t *from, size_t tolen,
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 4354dc07b8..4d18ca8a85 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,9 +1034,12 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
+WARNING:  ICU collator for language tag "nonsense" not found
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1144,6 +1148,7 @@ drop type textrange_en_us;
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+NOTICE:  using language tag "und-u-kc-ks-level1" for locale "@colStrength=primary;colCaseLevel=yes"
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
@@ -1151,6 +1156,7 @@ SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignor
 (1 row)
 
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+NOTICE:  using language tag "und-u-kb" for locale "@colBackwards=yes"
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1158,7 +1164,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1166,6 +1174,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
@@ -1173,6 +1182,7 @@ SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE te
 (1 row)
 
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+NOTICE:  using language tag "und-u-kn" for locale "@colNumeric=yes"
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1184,6 +1194,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1192,7 +1203,9 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1242,7 +1255,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1710,6 +1725,7 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+NOTICE:  using language tag "und-u-kc-ks-level1" for locale "@colStrength=primary;colCaseLevel=yes"
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index b0ddc7db44..3a7e7202af 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,6 +372,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
-- 
2.34.1

#18

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#17)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v3-0001-ICU-locale-string-canonicalization-and-validation.patchtext/x-patch; charset=UTF-8; name=v3-0001-ICU-locale-string-canonicalization-and-validation.patchDownload

From e7b67f0410a18c32cf271532f7a4719cf8c1c560 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Feb 2023 23:05:08 -0800
Subject: [PATCH v3] ICU locale string canonicalization and validation.

Before storing the locale name in the catalog, convert to a BCP47
language tag. The language tag is an unambiguous representation of the
locale that holds all of the necessary information.

Also, add a new GUC icu_locale_validation. When set to true, it raises
an ERROR if the locale string is malformed or if it is not a valid
locale in ICU.

During pg_upgrade, the previous locale string is preserved verbatim.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      |  16 +++
 doc/src/sgml/func.sgml                        |  17 +++
 src/backend/commands/collationcmds.c          | 101 +++++++++++---
 src/backend/commands/dbcommands.c             |  39 ++++++
 src/backend/utils/adt/pg_locale.c             | 128 ++++++++++++++++--
 src/backend/utils/misc/guc_tables.c           |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/pg_dump/t/002_pg_dump.pl              |   4 +-
 src/include/catalog/pg_proc.dat               |   5 +
 src/include/commands/dbcommands.h             |   1 +
 src/include/utils/pg_locale.h                 |   4 +
 .../regress/expected/collate.icu.utf8.out     |  87 +++++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     |  24 +++-
 13 files changed, 405 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..f7fdb54a1b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9776,6 +9776,22 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        If set to <literal>true</literal>, validates that ICU locale strings
+        are well-formed, and that they represent valid locale in ICU. Does not
+        cause any locale string to be rejected during <xref
+        linkend="pgupgrade"/>. The default is <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0cbdf63632..e2604c41ad 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27406,6 +27406,23 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_icu_language_tag</primary>
+        </indexterm>
+        <function>pg_icu_language_tag</function> ( <parameter>locale</parameter> <type>text</type>, <parameter>validate</parameter> <type>boolean</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Canonicalizes the given <parameter>locale</parameter> string into a
+        BCP 47 language tag (see <xref
+        linkend="collation-managing-create-icu"/>). If
+        <parameter>validate</parameter> is <literal>true</literal>, check that
+        the resulting language tag represents a valid locale in ICU.
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index eb62d285ea..8edc22f579 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -47,6 +47,8 @@ typedef struct
 	int			enc;			/* encoding */
 } CollAliasData;
 
+extern bool icu_locale_validation;
+
 
 /*
  * CREATE COLLATION
@@ -240,10 +242,50 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+
+			/*
+			 * During binary upgrade, preserve locale string verbatim.
+			 * Otherwise, canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				int elevel = icu_locale_validation ? ERROR : WARNING;
+
+				langtag = icu_language_tag(colliculocale);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					if (!icu_collator_exists(langtag))
+						ereport(elevel,
+								(errmsg("ICU collator for language tag \"%s\" not found",
+										langtag)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(elevel,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
@@ -556,26 +598,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -938,7 +960,11 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name);
+			if (langtag == NULL)
+				ereport(ERROR,
+						(errmsg("could not convert locale name \"%s\" to language tag",
+								name)));
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
@@ -996,3 +1022,36 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(ncreated);
 }
+
+/*
+ * pg_icu_language_tag
+ *
+ * Return the BCP47 language tag representation of the given locale string.
+ */
+Datum
+pg_icu_language_tag(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	text	*locale_text = PG_GETARG_TEXT_PP(0);
+	bool	 validate	 = PG_GETARG_BOOL(1);
+	char	*locale_cstr = text_to_cstring(locale_text);
+	char	*langtag	 = icu_language_tag(locale_cstr);
+
+	if (langtag == NULL)
+		ereport(ERROR,
+				(errmsg("could not convert locale \"%s\" to language tag",
+						locale_cstr)));
+
+	if (validate && !icu_collator_exists(langtag))
+		ereport(ERROR,
+				(errmsg("ICU collator for language tag \"%s\" not found",
+						langtag)));
+
+	PG_RETURN_TEXT_P(cstring_to_text(langtag));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index a0259cc593..a19881ad9c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -109,6 +109,7 @@ typedef struct CreateDBRelInfo
 	bool		permanent;		/* relation is permanent or unlogged */
 } CreateDBRelInfo;
 
+extern bool icu_locale_validation;
 
 /* non-export function prototypes */
 static void createdb_failure_callback(int code, Datum arg);
@@ -1029,6 +1030,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1045,6 +1049,41 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+
+		/*
+		 * During binary upgrade, preserve locale string verbatim. Otherwise,
+		 * canonicalize to a language tag.
+		 */
+		if (!IsBinaryUpgrade)
+		{
+			int elevel = icu_locale_validation ? ERROR : WARNING;
+
+			langtag = icu_language_tag(dbiculocale);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				if (!icu_collator_exists(langtag))
+					ereport(elevel,
+							(errmsg("ICU collator for language tag \"%s\" not found",
+									langtag)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(elevel,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 4aa5eaa984..6378a6d5ca 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2690,15 +2690,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	}
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
@@ -2712,13 +2709,128 @@ check_icu_locale(const char *icu_locale)
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
 		icu_set_collation_attributes(collator, icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+/*
+ * Check if the locale string represents the root locale. It represents the
+ * root locale if the language part is "und", "root", or the empty string.
+ */
+static bool
+icu_is_root_locale(const char *locale)
+{
+	UErrorCode	 status;
+	char		*lang_part;
+	int32_t		 len;
+	bool		 result	   = false;
+
+	status = U_ZERO_ERROR;
+	len  = uloc_getLanguage(locale, NULL, 0, &status);
+	lang_part = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(locale, lang_part, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not get language name from locale string \"%s\": %s",
+						locale, u_errorName(status))));
+
+	if (pg_strcasecmp(lang_part, "root") == 0 ||
+		pg_strcasecmp(lang_part, "und") == 0 ||
+		pg_strcasecmp(lang_part, "") == 0)
+		result = true;
+
+	pfree(lang_part);
+	return result;
+}
+
+/*
+ * Special case to check for locales like "POSIX" or "C.UTF-8". These are not
+ * handled by ICU level 2 canonicalization.
+ */
+static bool
+icu_is_c_posix(const char *locale)
+{
+	if (pg_strcasecmp(locale, "c") == 0 ||
+		pg_strncasecmp(locale, "c.", sizeof("c.") - 1) == 0 ||
+		pg_strcasecmp(locale, "posix") == 0 ||
+		pg_strncasecmp(locale, "posix.", sizeof("posix.") - 1) == 0)
+		return true;
+
+	return false;
+}
+
+/*
+ * Check if the given language tag resolves to a valid locale in ICU.
+ *
+ * If the resulting collator falls back to the root locale, and the root
+ * locale was not explicitly requested, return false.
+ */
+bool
+icu_collator_exists(const char *langtag)
+{
+	UCollator	*collator;
+	const char	*valid_locale = NULL;
+	UErrorCode	 status;
+	bool		 result		  = false;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(langtag, &status);
+	if (U_FAILURE(status))
+		return false;
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status);
+	if (U_FAILURE(status))
+		goto cleanup;
+
+	if (icu_is_root_locale(langtag) ||
+		!icu_is_root_locale(valid_locale))
+		result = true;
+
+cleanup:
+	ucol_close(collator);
+	return result;
+}
+
+/*
+ * Return the BCP47 language tag representation of the requested locale; or
+ * NULL if a problem is encountered.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent result format,
+ * level 2 canonicalization is able to more accurately interpret different
+ * input locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *requested_locale)
+{
+	UErrorCode	 status;
+	char		*result;
+	int32_t		 len;
+	const bool	 strict = true;
+
+	/* c/posix locales aren't handled by uloc_getLanguageTag() */
+	if (icu_is_c_posix(requested_locale))
+		return pstrdup("en-US-u-va-posix");
+
+	status = U_ZERO_ERROR;
+	len = uloc_toLanguageTag(requested_locale, NULL, 0, strict, &status);
+
+	result = palloc(len + 1);
+
+	status = U_ZERO_ERROR;
+	uloc_toLanguageTag(requested_locale, result, len + 1, strict, &status);
+	if (U_FAILURE(status))
+	{
+		pfree(result);
+		return NULL;
+	}
+
+	return result;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..4aa53259dc 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -481,6 +481,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		icu_locale_validation = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_SUSET, CLIENT_CONN_LOCALE,
+			gettext_noop("Validate ICU locale strings."),
+			NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 72b19ee6cd..cb34f4af7f 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1721,9 +1721,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index e2a7642a2b..7d5641572e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11806,6 +11806,11 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '6273',
+  descr => 'get BCP47 language tag representation of locale',
+  proname => 'pg_icu_language_tag', procost => '100',
+  provolatile => 's', prorettype => 'text', proargtypes => 'text bool',
+  prosrc => 'pg_icu_language_tag' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/commands/dbcommands.h b/src/include/commands/dbcommands.h
index 5fbc3ca752..0f0e827ff2 100644
--- a/src/include/commands/dbcommands.h
+++ b/src/include/commands/dbcommands.h
@@ -33,5 +33,6 @@ extern char *get_database_name(Oid dbid);
 extern bool have_createdb_privilege(void);
 
 extern void check_encoding_locale_matches(int encoding, const char *collate, const char *ctype);
+extern char *get_icu_locale(const char *requested_locale);
 
 #endif							/* DBCOMMANDS_H */
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index b8f22875a8..58cd4297bb 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -118,8 +118,12 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern bool check_equivalent_icu_locales(const char *locale1,
+										 const char *locale2);
 #endif
 extern void check_icu_locale(const char *icu_locale);
+extern bool icu_collator_exists(const char *requested_locale);
+extern char *icu_language_tag(const char *requested_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
 extern size_t wchar2char(char *to, const wchar_t *from, size_t tolen,
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 4354dc07b8..927718a937 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -11,6 +11,63 @@ SELECT getdatabaseencoding() <> 'UTF8' OR
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US', true);
+ pg_icu_language_tag 
+---------------------
+ en-US
+(1 row)
+
+SELECT pg_icu_language_tag('nonsense', false);
+ pg_icu_language_tag 
+---------------------
+ nonsense
+(1 row)
+
+SELECT pg_icu_language_tag('nonsense', true); -- error
+ERROR:  ICU collator for language tag "nonsense" not found
+SELECT pg_icu_language_tag('C.UTF-8', true);
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('POSIX', true);
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('en_US_POSIX', true);
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('@colStrength=secondary', true);
+ pg_icu_language_tag 
+---------------------
+ und-u-ks-level2
+(1 row)
+
+SELECT pg_icu_language_tag('', true);
+ pg_icu_language_tag 
+---------------------
+ und
+(1 row)
+
+SELECT pg_icu_language_tag('fr_CA.UTF-8', true);
+ pg_icu_language_tag 
+---------------------
+ fr-CA
+(1 row)
+
+SELECT pg_icu_language_tag('en_US@colStrength=primary', true);
+ pg_icu_language_tag 
+---------------------
+ en-US-u-ks-level1
+(1 row)
+
 CREATE TABLE collate_test1 (
     a int,
     b text COLLATE "en-x-icu" NOT NULL
@@ -1019,6 +1076,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,9 +1091,24 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = TRUE;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
+NOTICE:  using language tag "nonsense" for locale "nonsense"
+ERROR:  ICU collator for language tag "nonsense" not found
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense@colStrength=primary');
+NOTICE:  using language tag "nonsense-u-ks-level1" for locale "nonsense@colStrength=primary"
+WARNING:  ICU collator for language tag "nonsense-u-ks-level1" not found
+SELECT colliculocale FROM pg_collation WHERE collname='testx';
+    colliculocale     
+----------------------
+ nonsense-u-ks-level1
+(1 row)
+
+DROP COLLATION testx;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1144,6 +1217,7 @@ drop type textrange_en_us;
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+NOTICE:  using language tag "und-u-kc-ks-level1" for locale "@colStrength=primary;colCaseLevel=yes"
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
@@ -1151,6 +1225,7 @@ SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignor
 (1 row)
 
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+NOTICE:  using language tag "und-u-kb" for locale "@colBackwards=yes"
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1158,7 +1233,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1166,6 +1243,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
@@ -1173,6 +1251,7 @@ SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE te
 (1 row)
 
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+NOTICE:  using language tag "und-u-kn" for locale "@colNumeric=yes"
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1184,6 +1263,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1192,7 +1272,9 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1242,7 +1324,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1710,6 +1794,7 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+NOTICE:  using language tag "und-u-kc-ks-level1" for locale "@colStrength=primary;colCaseLevel=yes"
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index b0ddc7db44..26096f0627 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -15,6 +15,17 @@ SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
 
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US', true);
+SELECT pg_icu_language_tag('nonsense', false);
+SELECT pg_icu_language_tag('nonsense', true); -- error
+SELECT pg_icu_language_tag('C.UTF-8', true);
+SELECT pg_icu_language_tag('POSIX', true);
+SELECT pg_icu_language_tag('en_US_POSIX', true);
+SELECT pg_icu_language_tag('@colStrength=secondary', true);
+SELECT pg_icu_language_tag('', true);
+SELECT pg_icu_language_tag('fr_CA.UTF-8', true);
+SELECT pg_icu_language_tag('en_US@colStrength=primary', true);
 
 CREATE TABLE collate_test1 (
     a int,
@@ -357,6 +368,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,8 +383,17 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+
+SET icu_locale_validation = TRUE;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense@colStrength=primary');
+SELECT colliculocale FROM pg_collation WHERE collname='testx';
+DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

#19

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#18)

Re: ICU locale validation / canonicalization

On 28.02.23 06:57, Jeff Davis wrote:

On Mon, 2023-02-20 at 15:23 -0800, Jeff Davis wrote:

New patch attached. The new patch also includes a GUC that (when
enabled) validates that the collator is actually found.

New patch attached.

Now it always preserves the exact locale string during pg_upgrade, and
does not attempt to canonicalize it. Before it was trying to be clever
by determining if the language tag was finding the same collator as the
original string -- I didn't find a problem with that, but it just
seemed a bit too clever. So, only newly-created locales and databases
have the ICU locale string canonicalized to a language tag.

Also, I added a SQL function pg_icu_language_tag() that can convert
locale strings to language tags, and check whether they exist or not.

This patch appears to do about three things at once, and it's not clear
exactly where the boundaries are between them and which ones we might
actually want. And I think the terminology also gets mixed up a bit,
which makes following this harder.

1. Canonicalizing the locale string. This is presumably what
uloc_canonicalize() does, which the patch doesn't actually use. What
are examples of what this does? Does the patch actually do this?

2. Converting the locale string to BCP 47 format. This converts
'de@collation=phonebook' to 'de-u-co-phonebk'. This is what
uloc_getLanguageTag() does.

3. Validating the locale string, to reject faulty input.

What are the relationships between these?

I don't understand how the validation actually happens in your patch.
Does uloc_getLanguageTag() do the validation also?

Can you do canonicalization without converting to language tag?

Can you do validation of un-canonicalized locale names?

What is the guidance for the use of the icu_locale_validation GUC?

The description throws in yet another term: "validates that ICU locale
strings are well-formed". What is "well-formed"? How does that relate
to the other concepts?

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to migrate
people to the "new" spellings, but the old ones aren't wrong. So this
should be a separate consideration, with an option, and it would require
various updates in the documentation. It also doesn't appear to address
how to handle ICU before version 54.

But, see earlier questions, are these three things all connected somehow?

#20

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#19)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-09 at 09:46 +0100, Peter Eisentraut wrote:

This patch appears to do about three things at once, and it's not
clear
exactly where the boundaries are between them and which ones we might
actually want. And I think the terminology also gets mixed up a bit,
which makes following this harder.

1. Canonicalizing the locale string. This is presumably what
uloc_canonicalize() does, which the patch doesn't actually use. What
are examples of what this does? Does the patch actually do this?

Both uloc_canonicalize() and uloc_getLanguageTag() do Level 2
Canonicalization, which is described here:

https://unicode-org.github.io/icu/userguide/locale/#canonicalization

2. Converting the locale string to BCP 47 format. This converts
'de@collation=phonebook' to 'de-u-co-phonebk'. This is what
uloc_getLanguageTag() does.

Yes, though uloc_getLanguageTag() also canonicalizes. I consider
converting to the language tag a part of "canonicalization", because
it's the canonical form we agreed on in this thread.

3. Validating the locale string, to reject faulty input.

Canonicalization doesn't make sure the locale actually exists in ICU,
so it's easy to make a typo like "jp_JP" instead of "ja_JP". After
canonicalizing to a language tag, the former is "jp-JP" (resolving to
the collator with valid locale "root") and the latter is "ja-JP"
(resolving to the collator with valid locale "ja"). The former is
clearly a mistake, and I call catching that mistake "validation".

If the user specifies something other than the root locale (i.e. not
"root", "und", or ""), and the locale resolves to a collator with a
valid locale of "root", then this patch considers that to be a mistake
and issues a WARNING (upgraded to ERROR if the GUC
icu_locale_validation is true).

What are the relationships between these?

1 & 2 are closely related. If we canonicalize, we need to pick one
canonical form: either BCP 47 or ICU format locale IDs.

3 is related, but can be seen as an independent change.

I don't understand how the validation actually happens in your patch.
Does uloc_getLanguageTag() do the validation also?

Using the above definition of "validation" it happens inside
icu_collator_exists().

Can you do canonicalization without converting to language tag?

If we used uloc_canonicalize(), it would give us ICU format locale IDs,
and that would be a valid thing to do; and we could switch the
canonical form from ICU format locale IDs to BCP 47 in a separate
patch. I don't have a strong opinion, but if we're going to
canonicalize, I think it makes sense to go straight to language tags.

Can you do validation of un-canonicalized locale names?

Yes, though I feel like an un-canonicalized name is less stable in
meaning, and so validation on that name may also be less stable.

For instance, if we don't canonicalize "fr_CA.UTF-8", it resolves to
plain "fr"; but if we do canonicalize it first, it resolves to "fr-CA".
Will the uncanonicalized name always resolve to "fr"? I'm not sure,
because the documentation says that ucol_open() expects either an ICU
format locale ID or, preferably, a language tag.

So they are technically independently useful changes, but I would
recommend that canonicalization goes in first.

What is the guidance for the use of the icu_locale_validation GUC?

If an error when creating a new collation or database due to a bad
locale name would be highly disruptive, leave it false. If such an
error would be helpful to make sure you get the locale you expect, then
turn it on. In practice, existing important production systems would
leave it off; new systems could turn it on to help avoid
misconfigurations/mistakes.

The description throws in yet another term: "validates that ICU
locale
strings are well-formed". What is "well-formed"? How does that
relate
to the other concepts?

Good point, I don't think I need to redefine "validation". Maybe I
should just describe it as elevating canonicalization or validation
problems from WARNING to ERROR.

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to
migrate
people to the "new" spellings, but the old ones aren't wrong.

I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.

But at the same time I can see it being annoying or confusing. If it's
confusing, perhaps a wording change and documentation would improve it?
If it's annoying, we might need to have an option and/or a different
log level?

It also doesn't appear to address
how to handle ICU before version 54.

Do you have a specific concern here?

Regards,
Jeff Davis

#21

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#20)

Re: ICU locale validation / canonicalization

On 09.03.23 21:17, Jeff Davis wrote:

Personally, I'm not on board with this behavior:

=> CREATE COLLATION test (provider = icu, locale =
'de@collation=phonebook');
NOTICE: 00000: using language tag "de-u-co-phonebk" for locale
"de@collation=phonebook"

I mean, maybe that is a thing we want to do somehow sometime, to
migrate
people to the "new" spellings, but the old ones aren't wrong.

I see what you mean; I'm not sure the best thing to do here. We are
adjusting the string passed by the user, and it feels like some users
might want to know that. It's a NOTICE, not a WARNING, so it's not
meant to imply that it's wrong.

For clarification, I wasn't complaining about the notice, but about the
automatic conversion from old-style ICU locale ID to language tag.

It also doesn't appear to address
how to handle ICU before version 54.

Do you have a specific concern here?

What we had discussed a while ago in one of these threads is that ICU
before version 54 do not support keyword lists, and we have custom code
to do that parsing ourselves, but we don't have code to do the same for
language tags. Therefore, if I understand this right, if we
automatically convert ICU locale IDs to language tags, as shown above,
then we break support for such locales in those older ICU versions.

#22

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#21)

Re: ICU locale validation / canonicalization

On Mon, 2023-03-13 at 08:25 +0100, Peter Eisentraut wrote:

For clarification, I wasn't complaining about the notice, but about
the
automatic conversion from old-style ICU locale ID to language tag.

Canonicalization means that we pick one format, and automatically
convert to it, right?

What we had discussed a while ago in one of these threads is that ICU
before version 54 do not support keyword lists, and we have custom
code
to do that parsing ourselves, but we don't have code to do the same
for
language tags. Therefore, if I understand this right, if we
automatically convert ICU locale IDs to language tags, as shown
above,
then we break support for such locales in those older ICU versions.

Right. In versions 53 and earlier, and during pg_upgrade, we would just
preserve the locale string as entered.

Alternatively, we could canonicalize to the ICU format locale IDs. Or
add something to parse out the attributes from a language tag.

Regards,
Jeff Davis

#23

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#22)

Re: ICU locale validation / canonicalization

On 13.03.23 16:31, Jeff Davis wrote:

What we had discussed a while ago in one of these threads is that ICU
before version 54 do not support keyword lists, and we have custom
code
to do that parsing ourselves, but we don't have code to do the same
for
language tags. Therefore, if I understand this right, if we
automatically convert ICU locale IDs to language tags, as shown
above,
then we break support for such locales in those older ICU versions.

Right. In versions 53 and earlier, and during pg_upgrade, we would just
preserve the locale string as entered.

Another issue that came to mind: Right now, you can, say, develop SQL
schemas on a newer ICU version, say, your laptop, and then deploy them
on a server running an older ICU version. If we have a cutoff beyond
which we convert ICU locale IDs to language tags, then this won't work
anymore for certain combinations. And RHEL/CentOS 7 is still pretty
popular.

#24

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#23)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Tue, 2023-03-14 at 08:08 +0100, Peter Eisentraut wrote:

Another issue that came to mind: Right now, you can, say, develop
SQL
schemas on a newer ICU version, say, your laptop, and then deploy
them
on a server running an older ICU version. If we have a cutoff beyond
which we convert ICU locale IDs to language tags, then this won't
work
anymore for certain combinations. And RHEL/CentOS 7 is still pretty
popular.

If we just uloc_canonicalize() in icu_set_collation_attributes() then
versions 50-53 can support language tags. Patch attached.

One loose end is that we really should support language tags like "und"
in those older versions (54 and earlier). Your commit d72900bded
avoided the problem, but perhaps we should fix it by looking for "und"
and replacing it with "root" while opening, or something.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-Support-language-tags-in-older-ICU-versions-53-an.patchtext/x-patch; charset=UTF-8; name=v1-0001-Support-language-tags-in-older-ICU-versions-53-an.patchDownload

From ba1f0794d3be3fe7a9a365f4b6312bef9c215728 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 09:58:29 -0700
Subject: [PATCH v1] Support language tags in older ICU versions (53 and
 earlier).

By calling uloc_canonicalize() before parsing the attributes, the
existing locale attribute parsing logic works on language tags as
well.

Fix a small memory leak, too.
---
 src/backend/commands/collationcmds.c          |  8 +++---
 src/backend/utils/adt/pg_locale.c             | 26 ++++++++++++++++---
 .../regress/expected/collate.icu.utf8.out     |  8 ++++++
 src/test/regress/sql/collate.icu.utf8.sql     |  4 +++
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8949684afe..b8f2e7059f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -950,7 +950,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			const char *name;
 			char	   *langtag;
 			char	   *icucomment;
-			const char *iculocstr;
 			Oid			collid;
 
 			if (i == -1)
@@ -959,20 +958,19 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 				name = uloc_getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
-			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+			if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
 				continue;
 
 			collid = CollationCreate(psprintf("%s-x-icu", langtag),
 									 nspid, GetUserId(),
 									 COLLPROVIDER_ICU, true, -1,
-									 NULL, NULL, iculocstr, NULL,
-									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 NULL, NULL, langtag, NULL,
+									 get_collation_actual_version(COLLPROVIDER_ICU, langtag),
 									 true, true);
 			if (OidIsValid(collid))
 			{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1d3d4d86d3..b9c7fbd511 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2643,9 +2643,28 @@ pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
 {
-	char	   *str = asc_tolower(loc, strlen(loc));
+	UErrorCode	status;
+	int32_t		len;
+	char	   *icu_locale_id;
+	char	   *lower_str;
+	char	   *str;
 
-	str = strchr(str, '@');
+	/* first, make sure the string is an ICU format locale ID */
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, &status);
+	icu_locale_id = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("canonicalization failed for locale string \"%s\": %s",
+						loc, u_errorName(status))));
+
+	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
+
+	pfree(icu_locale_id);
+
+	str = strchr(lower_str, '@');
 	if (!str)
 		return;
 	str++;
@@ -2660,7 +2679,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			char	   *value;
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
-			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
 
@@ -2727,6 +2745,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 								loc, u_errorName(status))));
 		}
 	}
+
+	pfree(lower_str);
 }
 
 #endif							/* USE_ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9a3e12e42d..057c1d1bf6 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1304,6 +1304,14 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
  t        | t
 (1 row)
 
+-- test language tags
+CREATE COLLATION langtag (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE langtag = 'AbCd' COLLATE langtag;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 0790068f31..b9f5798715 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -518,6 +518,10 @@ CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=second
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
 SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_insensitive;
 
+-- test language tags
+CREATE COLLATION langtag (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE langtag = 'AbCd' COLLATE langtag;
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
-- 
2.34.1

#25

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#24)

3 attachment(s)

Re: ICU locale validation / canonicalization

On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote:

One loose end is that we really should support language tags like
"und"
in those older versions (54 and earlier). Your commit d72900bded
avoided the problem, but perhaps we should fix it by looking for
"und"
and replacing it with "root" while opening, or something.

Attached are a few patches to implement this idea.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v4-0001-Support-language-tags-in-older-ICU-versions-53-an.patchtext/x-patch; charset=UTF-8; name=v4-0001-Support-language-tags-in-older-ICU-versions-53-an.patchDownload

From fa789a446b425cd491a0452acb2b2f135ce06f5a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 09:58:29 -0700
Subject: [PATCH v4 1/3] Support language tags in older ICU versions (53 and
 earlier).

By calling uloc_canonicalize() before parsing the attributes, the
existing locale attribute parsing logic works on language tags as
well.

Fix a small memory leak, too.
---
 src/backend/commands/collationcmds.c          |  8 +++---
 src/backend/utils/adt/pg_locale.c             | 26 ++++++++++++++++---
 .../regress/expected/collate.icu.utf8.out     |  8 ++++++
 src/test/regress/sql/collate.icu.utf8.sql     |  4 +++
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8949684afe..b8f2e7059f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -950,7 +950,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			const char *name;
 			char	   *langtag;
 			char	   *icucomment;
-			const char *iculocstr;
 			Oid			collid;
 
 			if (i == -1)
@@ -959,20 +958,19 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 				name = uloc_getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
-			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+			if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
 				continue;
 
 			collid = CollationCreate(psprintf("%s-x-icu", langtag),
 									 nspid, GetUserId(),
 									 COLLPROVIDER_ICU, true, -1,
-									 NULL, NULL, iculocstr, NULL,
-									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 NULL, NULL, langtag, NULL,
+									 get_collation_actual_version(COLLPROVIDER_ICU, langtag),
 									 true, true);
 			if (OidIsValid(collid))
 			{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1d3d4d86d3..b9c7fbd511 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2643,9 +2643,28 @@ pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
 {
-	char	   *str = asc_tolower(loc, strlen(loc));
+	UErrorCode	status;
+	int32_t		len;
+	char	   *icu_locale_id;
+	char	   *lower_str;
+	char	   *str;
 
-	str = strchr(str, '@');
+	/* first, make sure the string is an ICU format locale ID */
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, &status);
+	icu_locale_id = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("canonicalization failed for locale string \"%s\": %s",
+						loc, u_errorName(status))));
+
+	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
+
+	pfree(icu_locale_id);
+
+	str = strchr(lower_str, '@');
 	if (!str)
 		return;
 	str++;
@@ -2660,7 +2679,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			char	   *value;
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
-			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
 
@@ -2727,6 +2745,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 								loc, u_errorName(status))));
 		}
 	}
+
+	pfree(lower_str);
 }
 
 #endif							/* USE_ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9a3e12e42d..6225b575ce 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1304,6 +1304,14 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
  t        | t
 (1 row)
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 0790068f31..64cbfd0a5b 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -518,6 +518,10 @@ CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=second
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
 SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_insensitive;
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
-- 
2.34.1

v4-0002-Wrap-ICU-ucol_open.patchtext/x-patch; charset=UTF-8; name=v4-0002-Wrap-ICU-ucol_open.patchDownload

From c9f089eeedb6c145ad36b8618da7a71d0e69fd8c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 21:21:17 -0700
Subject: [PATCH v4 2/3] Wrap ICU ucol_open().

Hide details of supporting older ICU versions in a wrapper
function. The current code only needs to handle
icu_set_collation_attributes(), but a subsequent commit will add
additional version-specific code.
---
 src/backend/utils/adt/pg_locale.c | 52 +++++++++++++++----------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b9c7fbd511..04d1fa72a8 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1420,24 +1420,35 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
-void
-make_icu_collator(const char *iculocstr,
-				  const char *icurules,
-				  struct pg_locale_struct *resultp)
+static UCollator *
+pg_ucol_open(const char *locale_str)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = ucol_open(locale_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						locale_str, u_errorName(status))));
+
+#if U_ICU_VERSION_MAJOR_NUM < 54
+	icu_set_collation_attributes(collator, locale_str);
+#endif
+
+	return collator;
+}
+
+void
+make_icu_collator(const char *iculocstr,
+				  const char *icurules,
+				  struct pg_locale_struct *resultp)
+{
+#ifdef USE_ICU
+	UCollator  *collator;
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1448,6 +1459,7 @@ make_icu_collator(const char *iculocstr,
 		const UChar *default_rules;
 		UChar	   *agg_rules;
 		UChar	   *my_rules;
+		UErrorCode	status;
 		int32_t		length;
 
 		default_rules = ucol_getRules(collator, &length);
@@ -1719,16 +1731,11 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	if (collprovider == COLLPROVIDER_ICU)
 	{
 		UCollator  *collator;
-		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
-		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+		collator = pg_ucol_open(collcollate);
+
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -2639,6 +2646,7 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  * ucol_open(), so this is only necessary for emulating this behavior on older
  * versions.
  */
+#if U_ICU_VERSION_MAJOR_NUM < 54
 pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
@@ -2748,6 +2756,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 
 	pfree(lower_str);
 }
+#endif
 
 #endif							/* USE_ICU */
 
@@ -2759,17 +2768,8 @@ check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
+	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
 #else
 	ereport(ERROR,
-- 
2.34.1

v4-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchtext/x-patch; charset=UTF-8; name=v4-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchDownload

From 0ae14118d593d73ea441648fc3c8df067253f627 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 22:28:21 -0700
Subject: [PATCH v4 3/3] Handle the "und" locale in ICU versions 54 and older.

The "und" locale is an alternative spelling of the root locale, but it
was not recognized until ICU 55. To maintain common behavior across
all supported ICU versions, check for "und" and replace with "root"
before opening.

Previously, the lack of support for "und" was dangerous, because
versions 54 and older fall back to the environment when a locale is
not found. If the user specified "und" for the language (which is
expected and documented), it could not only resolve to the wrong
collator, but it could unexpectedly change (which could lead to
corrupt indexes).

This effectively reverts commit d72900bded, which worked around the
problem for the built-in "unicode" collation, and is no longer
necessary.
---
 src/backend/utils/adt/pg_locale.c             | 30 +++++++++++++++++++
 src/bin/initdb/initdb.c                       |  2 +-
 src/include/catalog/catversion.h              |  2 +-
 .../regress/expected/collate.icu.utf8.out     |  7 +++++
 src/test/regress/sql/collate.icu.utf8.sql     |  2 ++
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 04d1fa72a8..4dee0f40cb 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1420,12 +1420,35 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
+#ifdef USE_ICU
+
 static UCollator *
 pg_ucol_open(const char *locale_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
 
+	/*
+	 * In ICU versions 55 and earlier, "und" is not a recognized spelling of
+	 * the root locale. If the first component of the locale is "und", replace
+	 * with "root" before opening.
+	 */
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	char *fixed_str = NULL;
+
+	if (strncasecmp(locale_str, "und", strlen("und")) == 0 &&
+		!isalnum(locale_str[strlen("und")]))
+	{
+		const char *remainder = locale_str + strlen("und");
+
+		fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
+		strcpy(fixed_str, "root");
+		strcat(fixed_str, remainder);
+
+		locale_str = fixed_str;
+	}
+#endif
+
 	status = U_ZERO_ERROR;
 	collator = ucol_open(locale_str, &status);
 	if (U_FAILURE(status))
@@ -1437,9 +1460,16 @@ pg_ucol_open(const char *locale_str)
 	icu_set_collation_attributes(collator, locale_str);
 #endif
 
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	if (fixed_str != NULL)
+		pfree(fixed_str);
+#endif
+
 	return collator;
 }
 
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  const char *icurules,
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 68d430ed63..d48b7b6060 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd)
 	 * that they win if libc defines a locale with the same name.
 	 */
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
-				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n",
+				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n",
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
 
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 309aed3703..b2eed22d46 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303141
+#define CATALOG_VERSION_NO	202303151
 
 #endif
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6225b575ce..f135200c99 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  t
 (1 row)
 
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 64cbfd0a5b..8105ebc8ae 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
 
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
-- 
2.34.1

#26

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#25)

5 attachment(s)

Re: ICU locale validation / canonicalization

On Tue, 2023-03-14 at 23:47 -0700, Jeff Davis wrote:

On Tue, 2023-03-14 at 10:10 -0700, Jeff Davis wrote:

One loose end is that we really should support language tags like
"und"
in those older versions (54 and earlier). Your commit d72900bded
avoided the problem, but perhaps we should fix it by looking for
"und"
and replacing it with "root" while opening, or something.

Attached are a few patches to implement this idea.

Here is an updated patch series that includes these earlier fixes for
older ICU versions, with the canonicalization patch last (0005).

I left out the validation patch for now, and I'm evaluating a different
approach that will attempt to match to the locales retrieved with
uloc_countAvailable()/uloc_getAvailable().

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v5-0001-Support-language-tags-in-older-ICU-versions-53-an.patchtext/x-patch; charset=UTF-8; name=v5-0001-Support-language-tags-in-older-ICU-versions-53-an.patchDownload

From 70d98770ce6c1795ab172adf10bda87dafa310e3 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 09:58:29 -0700
Subject: [PATCH v5 1/5] Support language tags in older ICU versions (53 and
 earlier).

By calling uloc_canonicalize() before parsing the attributes, the
existing locale attribute parsing logic works on language tags as
well.

Fix a small memory leak, too.

Discussion: http://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          |  8 +++---
 src/backend/utils/adt/pg_locale.c             | 26 ++++++++++++++++---
 .../regress/expected/collate.icu.utf8.out     |  8 ++++++
 src/test/regress/sql/collate.icu.utf8.sql     |  4 +++
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8949684afe..b8f2e7059f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -950,7 +950,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			const char *name;
 			char	   *langtag;
 			char	   *icucomment;
-			const char *iculocstr;
 			Oid			collid;
 
 			if (i == -1)
@@ -959,20 +958,19 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 				name = uloc_getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
-			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+			if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
 				continue;
 
 			collid = CollationCreate(psprintf("%s-x-icu", langtag),
 									 nspid, GetUserId(),
 									 COLLPROVIDER_ICU, true, -1,
-									 NULL, NULL, iculocstr, NULL,
-									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 NULL, NULL, langtag, NULL,
+									 get_collation_actual_version(COLLPROVIDER_ICU, langtag),
 									 true, true);
 			if (OidIsValid(collid))
 			{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1d3d4d86d3..b9c7fbd511 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2643,9 +2643,28 @@ pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
 {
-	char	   *str = asc_tolower(loc, strlen(loc));
+	UErrorCode	status;
+	int32_t		len;
+	char	   *icu_locale_id;
+	char	   *lower_str;
+	char	   *str;
 
-	str = strchr(str, '@');
+	/* first, make sure the string is an ICU format locale ID */
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, &status);
+	icu_locale_id = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("canonicalization failed for locale string \"%s\": %s",
+						loc, u_errorName(status))));
+
+	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
+
+	pfree(icu_locale_id);
+
+	str = strchr(lower_str, '@');
 	if (!str)
 		return;
 	str++;
@@ -2660,7 +2679,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			char	   *value;
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
-			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
 
@@ -2727,6 +2745,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 								loc, u_errorName(status))));
 		}
 	}
+
+	pfree(lower_str);
 }
 
 #endif							/* USE_ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9a3e12e42d..6225b575ce 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1304,6 +1304,14 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
  t        | t
 (1 row)
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 0790068f31..64cbfd0a5b 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -518,6 +518,10 @@ CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=second
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
 SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_insensitive;
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
-- 
2.34.1

v5-0002-Wrap-ICU-ucol_open.patchtext/x-patch; charset=UTF-8; name=v5-0002-Wrap-ICU-ucol_open.patchDownload

From baf0aa5ff7db793424a73958e38d2ceb7d9877e4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 21:21:17 -0700
Subject: [PATCH v5 2/5] Wrap ICU ucol_open().

Hide details of supporting older ICU versions in a wrapper
function. The current code only needs to handle
icu_set_collation_attributes(), but a subsequent commit will add
additional version-specific code.
---
 src/backend/utils/adt/pg_locale.c | 54 ++++++++++++++++---------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b9c7fbd511..28d1715ece 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -143,8 +143,10 @@ static size_t uchar_length(UConverter *converter,
 static int32_t uchar_convert(UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, int32_t srclen);
+#if U_ICU_VERSION_MAJOR_NUM < 54
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
+#endif
 
 /*
  * pg_perm_setlocale
@@ -1420,24 +1422,35 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
-void
-make_icu_collator(const char *iculocstr,
-				  const char *icurules,
-				  struct pg_locale_struct *resultp)
+static UCollator *
+pg_ucol_open(const char *locale_str)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = ucol_open(locale_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						locale_str, u_errorName(status))));
+
+#if U_ICU_VERSION_MAJOR_NUM < 54
+	icu_set_collation_attributes(collator, locale_str);
+#endif
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	return collator;
+}
+
+void
+make_icu_collator(const char *iculocstr,
+				  const char *icurules,
+				  struct pg_locale_struct *resultp)
+{
+#ifdef USE_ICU
+	UCollator  *collator;
+
+	collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1448,6 +1461,7 @@ make_icu_collator(const char *iculocstr,
 		const UChar *default_rules;
 		UChar	   *agg_rules;
 		UChar	   *my_rules;
+		UErrorCode	status;
 		int32_t		length;
 
 		default_rules = ucol_getRules(collator, &length);
@@ -1719,16 +1733,11 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	if (collprovider == COLLPROVIDER_ICU)
 	{
 		UCollator  *collator;
-		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
-		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+		collator = pg_ucol_open(collcollate);
+
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -2639,6 +2648,7 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  * ucol_open(), so this is only necessary for emulating this behavior on older
  * versions.
  */
+#if U_ICU_VERSION_MAJOR_NUM < 54
 pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
@@ -2748,6 +2758,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 
 	pfree(lower_str);
 }
+#endif
 
 #endif							/* USE_ICU */
 
@@ -2759,17 +2770,8 @@ check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
+	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
 #else
 	ereport(ERROR,
-- 
2.34.1

v5-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchtext/x-patch; charset=UTF-8; name=v5-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchDownload

From 2b6e8998d016f5aa06d9d043fb538892162e966d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 22:28:21 -0700
Subject: [PATCH v5 3/5] Handle the "und" locale in ICU versions 54 and older.

The "und" locale is an alternative spelling of the root locale, but it
was not recognized until ICU 55. To maintain common behavior across
all supported ICU versions, check for "und" and replace with "root"
before opening.

Previously, the lack of support for "und" was dangerous, because
versions 54 and older fall back to the environment when a locale is
not found. If the user specified "und" for the language (which is
expected and documented), it could not only resolve to the wrong
collator, but it could unexpectedly change (which could lead to
corrupt indexes).

This effectively reverts commit d72900bded, which worked around the
problem for the built-in "unicode" collation, and is no longer
necessary.

Discussion: https://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.camel@j-davis.com
Discussion: https://postgr.es/m/0c6fa66f2753217d2a40480a96bd2ccf023536a1.camel@j-davis.com
---
 src/backend/utils/adt/pg_locale.c             | 30 +++++++++++++++++++
 src/bin/initdb/initdb.c                       |  2 +-
 src/include/catalog/catversion.h              |  2 +-
 .../regress/expected/collate.icu.utf8.out     |  7 +++++
 src/test/regress/sql/collate.icu.utf8.sql     |  2 ++
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 28d1715ece..a8cc8ffcb1 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1422,12 +1422,35 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
+#ifdef USE_ICU
+
 static UCollator *
 pg_ucol_open(const char *locale_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
 
+	/*
+	 * In ICU versions 55 and earlier, "und" is not a recognized spelling of
+	 * the root locale. If the first component of the locale is "und", replace
+	 * with "root" before opening.
+	 */
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	char *fixed_str = NULL;
+
+	if (strncasecmp(locale_str, "und", strlen("und")) == 0 &&
+		!isalnum(locale_str[strlen("und")]))
+	{
+		const char *remainder = locale_str + strlen("und");
+
+		fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
+		strcpy(fixed_str, "root");
+		strcat(fixed_str, remainder);
+
+		locale_str = fixed_str;
+	}
+#endif
+
 	status = U_ZERO_ERROR;
 	collator = ucol_open(locale_str, &status);
 	if (U_FAILURE(status))
@@ -1439,9 +1462,16 @@ pg_ucol_open(const char *locale_str)
 	icu_set_collation_attributes(collator, locale_str);
 #endif
 
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	if (fixed_str != NULL)
+		pfree(fixed_str);
+#endif
+
 	return collator;
 }
 
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  const char *icurules,
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 68d430ed63..d48b7b6060 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd)
 	 * that they win if libc defines a locale with the same name.
 	 */
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
-				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n",
+				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n",
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
 
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)"
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 309aed3703..b2eed22d46 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303141
+#define CATALOG_VERSION_NO	202303151
 
 #endif
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6225b575ce..f135200c99 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  t
 (1 row)
 
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 64cbfd0a5b..8105ebc8ae 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
 
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
-- 
2.34.1

v5-0004-Add-SQL-function-pg_icu_language_tag.patchtext/x-patch; charset=UTF-8; name=v5-0004-Add-SQL-function-pg_icu_language_tag.patchDownload

From aa3d9164177742e8a8c1ad25a3613d3ade1aa7b1 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 11:27:12 -0700
Subject: [PATCH v5 4/5] Add SQL function pg_icu_language_tag().

Account for locales "C" and "POSIX", which correspond to the
language tag "en-US-u-va-posix".

Also, don't rely on a fixed-size buffer for language tags, as there is
no defined upper limit (cf. RFC5646 section 4.4).
---
 doc/src/sgml/func.sgml                        | 15 +++++
 src/backend/commands/collationcmds.c          | 44 +++++++-------
 src/backend/utils/adt/pg_locale.c             | 59 +++++++++++++++++++
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/catalog/catversion.h              |  2 +-
 src/include/catalog/pg_proc.dat               |  5 ++
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 55 +++++++++++++++++
 src/test/regress/sql/collate.icu.utf8.sql     | 10 ++++
 9 files changed, 171 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 15314aa3ee..dd073b7e26 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27453,6 +27453,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_icu_language_tag</primary>
+        </indexterm>
+        <function>pg_icu_language_tag</function> ( <parameter>locale</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Canonicalizes the given <parameter>locale</parameter> string into a
+        BCP 47 language tag (see <xref
+        linkend="collation-managing-create-icu"/>).
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index b8f2e7059f..c6dea866da 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -576,26 +576,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -957,7 +937,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, false);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
@@ -1014,3 +994,25 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(ncreated);
 }
+
+/*
+ * pg_icu_language_tag
+ *
+ * Return the BCP47 language tag representation of the given locale string.
+ */
+Datum
+pg_icu_language_tag(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	text	*locale_text = PG_GETARG_TEXT_PP(0);
+	char	*locale_cstr = text_to_cstring(locale_text);
+	char	*langtag	 = icu_language_tag(locale_cstr, false);
+
+	PG_RETURN_TEXT_P(cstring_to_text(langtag));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index a8cc8ffcb1..0ac7ce9a80 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2810,6 +2810,65 @@ check_icu_locale(const char *icu_locale)
 #endif
 }
 
+#ifdef USE_ICU
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, bool noError)
+{
+	UErrorCode	 status;
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if ((pg_strncasecmp(loc_str, "c", 1) == 0 && !isalnum(loc_str[1])) ||
+		(pg_strncasecmp(loc_str, "posix", 5) == 0 && !isalnum(loc_str[5])))
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen || buflen >= MaxAllocSize)
+			break;
+
+		buflen = Min(buflen * 2, MaxAllocSize);
+		langtag = repalloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+		if (noError)
+			return NULL;
+
+		ereport(ERROR,
+				(errmsg("could not convert locale name \"%s\" to language tag: %s",
+						loc_str, u_errorName(status))));
+	}
+
+	return langtag;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b2eed22d46..8b205ab161 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303151
+#define CATALOG_VERSION_NO	202303161
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fbc4aade49..063f77f70c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11817,6 +11817,11 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '6273',
+  descr => 'get BCP47 language tag representation of locale',
+  proname => 'pg_icu_language_tag', procost => '100',
+  provolatile => 's', prorettype => 'text', proargtypes => 'text',
+  prosrc => 'pg_icu_language_tag' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index f9ce428233..b78fe0d72c 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -119,6 +119,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *locale_str, bool noError);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..ae608c29be 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -11,6 +11,61 @@ SELECT getdatabaseencoding() <> 'UTF8' OR
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US');
+ pg_icu_language_tag 
+---------------------
+ en-US
+(1 row)
+
+SELECT pg_icu_language_tag('nonsense');
+ pg_icu_language_tag 
+---------------------
+ nonsense
+(1 row)
+
+SELECT pg_icu_language_tag('C.UTF-8');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('POSIX');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('en_US_POSIX');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('@colStrength=secondary');
+ pg_icu_language_tag 
+---------------------
+ und-u-ks-level2
+(1 row)
+
+SELECT pg_icu_language_tag('');
+ pg_icu_language_tag 
+---------------------
+ und
+(1 row)
+
+SELECT pg_icu_language_tag('fr_CA.UTF-8');
+ pg_icu_language_tag 
+---------------------
+ fr-CA
+(1 row)
+
+SELECT pg_icu_language_tag('en_US@colStrength=primary');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-ks-level1
+(1 row)
+
 CREATE TABLE collate_test1 (
     a int,
     b text COLLATE "en-x-icu" NOT NULL
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..1cd823ef37 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -15,6 +15,16 @@ SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
 
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US');
+SELECT pg_icu_language_tag('nonsense');
+SELECT pg_icu_language_tag('C.UTF-8');
+SELECT pg_icu_language_tag('POSIX');
+SELECT pg_icu_language_tag('en_US_POSIX');
+SELECT pg_icu_language_tag('@colStrength=secondary');
+SELECT pg_icu_language_tag('');
+SELECT pg_icu_language_tag('fr_CA.UTF-8');
+SELECT pg_icu_language_tag('en_US@colStrength=primary');
 
 CREATE TABLE collate_test1 (
     a int,
-- 
2.34.1

v5-0005-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v5-0005-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From 7abe977948fb4a501a889b14838a71ebbd4f946a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v5 5/5] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 38 ++++++++++++++
 src/backend/commands/dbcommands.c             | 32 ++++++++++++
 src/backend/utils/adt/pg_locale.c             |  9 ----
 src/bin/initdb/initdb.c                       | 49 +++++++++++++++++++
 .../regress/expected/collate.icu.utf8.out     | 25 +++++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 13 +++++
 6 files changed, 156 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c6dea866da..bc0e48d507 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				langtag = icu_language_tag(colliculocale, true);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..5935477f44 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1059,6 +1062,35 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			langtag = icu_language_tag(dbiculocale, true);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 0ac7ce9a80..dd2857f1ff 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2790,27 +2790,18 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 }
 #endif
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 
 	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
-#ifdef USE_ICU
 /*
  * Return the BCP47 language tag representation of the requested locale.
  *
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d48b7b6060..3ce7b5991c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2039,6 +2039,50 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if ((pg_strncasecmp(loc_str, "c", 1) == 0 && !isalnum(loc_str[1])) ||
+		(pg_strncasecmp(loc_str, "posix", 5) == 0 && !isalnum(loc_str[5])))
+		return pstrdup("en-US-u-va-posix");
+
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen)
+			break;
+
+		buflen = buflen * 2;
+		langtag = pg_realloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+#endif
+
 /*
  * Check that ICU accepts the locale name; or if not specified, retrieve the
  * default ICU locale.
@@ -2049,6 +2093,7 @@ check_icu_locale(void)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	char		*langtag;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2080,6 +2125,10 @@ check_icu_locale(void)
 	}
 
 	ucol_close(collator);
+
+	langtag = icu_language_tag(icu_locale);
+	pg_free(icu_locale);
+	icu_locale = langtag;
 #endif
 }
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index ae608c29be..9715c12c93 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1074,6 +1074,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1088,9 +1089,11 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1217,14 +1220,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1232,7 +1239,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1240,13 +1249,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1258,6 +1270,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1266,6 +1279,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1293,10 +1307,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1346,7 +1363,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1361,6 +1380,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1368,6 +1388,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1828,7 +1849,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 1cd823ef37..ddd1ab7e66 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -367,6 +367,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -380,6 +382,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
@@ -464,10 +469,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -477,7 +486,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -666,7 +677,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

#27

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#26)

6 attachment(s)

Re: ICU locale validation / canonicalization

On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote:

I left out the validation patch for now, and I'm evaluating a
different
approach that will attempt to match to the locales retrieved with
uloc_countAvailable()/uloc_getAvailable().

I like this approach, attached new patch series with that included as
0006.

The first 3 patches are essentially bugfixes -- should they be
backported?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v6-0001-Support-language-tags-in-older-ICU-versions-53-an.patchtext/x-patch; charset=UTF-8; name=v6-0001-Support-language-tags-in-older-ICU-versions-53-an.patchDownload

From 39cb2f32e087ff95da42c67a7e9b147ceca994af Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 09:58:29 -0700
Subject: [PATCH v6 1/6] Support language tags in older ICU versions (53 and
 earlier).

By calling uloc_canonicalize() before parsing the attributes, the
existing locale attribute parsing logic works on language tags as
well.

Fix a small memory leak, too.

Discussion: http://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          |  8 +++---
 src/backend/utils/adt/pg_locale.c             | 26 ++++++++++++++++---
 .../regress/expected/collate.icu.utf8.out     |  8 ++++++
 src/test/regress/sql/collate.icu.utf8.sql     |  4 +++
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8949684afe..b8f2e7059f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -950,7 +950,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			const char *name;
 			char	   *langtag;
 			char	   *icucomment;
-			const char *iculocstr;
 			Oid			collid;
 
 			if (i == -1)
@@ -959,20 +958,19 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 				name = uloc_getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
-			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
 			 * pg_collation
 			 */
-			if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+			if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
 				continue;
 
 			collid = CollationCreate(psprintf("%s-x-icu", langtag),
 									 nspid, GetUserId(),
 									 COLLPROVIDER_ICU, true, -1,
-									 NULL, NULL, iculocstr, NULL,
-									 get_collation_actual_version(COLLPROVIDER_ICU, iculocstr),
+									 NULL, NULL, langtag, NULL,
+									 get_collation_actual_version(COLLPROVIDER_ICU, langtag),
 									 true, true);
 			if (OidIsValid(collid))
 			{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1d3d4d86d3..b9c7fbd511 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2643,9 +2643,28 @@ pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
 {
-	char	   *str = asc_tolower(loc, strlen(loc));
+	UErrorCode	status;
+	int32_t		len;
+	char	   *icu_locale_id;
+	char	   *lower_str;
+	char	   *str;
 
-	str = strchr(str, '@');
+	/* first, make sure the string is an ICU format locale ID */
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, &status);
+	icu_locale_id = palloc(len + 1);
+	status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("canonicalization failed for locale string \"%s\": %s",
+						loc, u_errorName(status))));
+
+	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
+
+	pfree(icu_locale_id);
+
+	str = strchr(lower_str, '@');
 	if (!str)
 		return;
 	str++;
@@ -2660,7 +2679,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			char	   *value;
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
-			UErrorCode	status;
 
 			status = U_ZERO_ERROR;
 
@@ -2727,6 +2745,8 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 								loc, u_errorName(status))));
 		}
 	}
+
+	pfree(lower_str);
 }
 
 #endif							/* USE_ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9a3e12e42d..6225b575ce 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1304,6 +1304,14 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
  t        | t
 (1 row)
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 0790068f31..64cbfd0a5b 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -518,6 +518,10 @@ CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=second
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
 SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_insensitive;
 
+-- test language tags
+CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
-- 
2.34.1

v6-0002-Wrap-ICU-ucol_open.patchtext/x-patch; charset=UTF-8; name=v6-0002-Wrap-ICU-ucol_open.patchDownload

From 93a74fa032fdf6db56e21278cf39dcb0e544cc33 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 21:21:17 -0700
Subject: [PATCH v6 2/6] Wrap ICU ucol_open().

Hide details of supporting older ICU versions in a wrapper
function. The current code only needs to handle
icu_set_collation_attributes(), but a subsequent commit will add
additional version-specific code.
---
 src/backend/utils/adt/pg_locale.c | 54 ++++++++++++++++---------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b9c7fbd511..c344f7e2a8 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -143,8 +143,10 @@ static size_t uchar_length(UConverter *converter,
 static int32_t uchar_convert(UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, int32_t srclen);
+#if U_ICU_VERSION_MAJOR_NUM < 54
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
+#endif
 
 /*
  * pg_perm_setlocale
@@ -1420,24 +1422,35 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
-void
-make_icu_collator(const char *iculocstr,
-				  const char *icurules,
-				  struct pg_locale_struct *resultp)
+static UCollator *
+pg_ucol_open(const char *loc_str)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 	UErrorCode	status;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
+	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
+						loc_str, u_errorName(status))));
+
+#if U_ICU_VERSION_MAJOR_NUM < 54
+	icu_set_collation_attributes(collator, loc_str);
+#endif
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	return collator;
+}
+
+void
+make_icu_collator(const char *iculocstr,
+				  const char *icurules,
+				  struct pg_locale_struct *resultp)
+{
+#ifdef USE_ICU
+	UCollator  *collator;
+
+	collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1448,6 +1461,7 @@ make_icu_collator(const char *iculocstr,
 		const UChar *default_rules;
 		UChar	   *agg_rules;
 		UChar	   *my_rules;
+		UErrorCode	status;
 		int32_t		length;
 
 		default_rules = ucol_getRules(collator, &length);
@@ -1719,16 +1733,11 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	if (collprovider == COLLPROVIDER_ICU)
 	{
 		UCollator  *collator;
-		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
-		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+		collator = pg_ucol_open(collcollate);
+
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -2639,6 +2648,7 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  * ucol_open(), so this is only necessary for emulating this behavior on older
  * versions.
  */
+#if U_ICU_VERSION_MAJOR_NUM < 54
 pg_attribute_unused()
 static void
 icu_set_collation_attributes(UCollator *collator, const char *loc)
@@ -2748,6 +2758,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 
 	pfree(lower_str);
 }
+#endif
 
 #endif							/* USE_ICU */
 
@@ -2759,17 +2770,8 @@ check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
+	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
 #else
 	ereport(ERROR,
-- 
2.34.1

v6-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchtext/x-patch; charset=UTF-8; name=v6-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchDownload

From a3442ec080ebce15ca6af8ab7284b905aa177164 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 22:28:21 -0700
Subject: [PATCH v6 3/6] Handle the "und" locale in ICU versions 54 and older.

The "und" locale is an alternative spelling of the root locale, but it
was not recognized until ICU 55. To maintain common behavior across
all supported ICU versions, check for "und" and replace with "root"
before opening.

Previously, the lack of support for "und" was dangerous, because
versions 54 and older fall back to the environment when a locale is
not found. If the user specified "und" for the language (which is
expected and documented), it could not only resolve to the wrong
collator, but it could unexpectedly change (which could lead to
corrupt indexes).

This effectively reverts commit d72900bded, which worked around the
problem for the built-in "unicode" collation, and is no longer
necessary.

Discussion: https://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.camel@j-davis.com
Discussion: https://postgr.es/m/0c6fa66f2753217d2a40480a96bd2ccf023536a1.camel@j-davis.com
---
 src/backend/utils/adt/pg_locale.c             | 39 +++++++++++++++++++
 src/bin/initdb/initdb.c                       |  2 +-
 .../regress/expected/collate.icu.utf8.out     |  7 ++++
 src/test/regress/sql/collate.icu.utf8.sql     |  2 +
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c344f7e2a8..11c6c25c15 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1422,12 +1422,44 @@ lc_ctype_is_c(Oid collation)
 
 struct pg_locale_struct default_locale;
 
+#ifdef USE_ICU
+
 static UCollator *
 pg_ucol_open(const char *loc_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
 
+	/*
+	 * In ICU versions 55 and earlier, "und" is not a recognized spelling of
+	 * the root locale. If the first component of the locale is "und", replace
+	 * with "root" before opening.
+	 */
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	char *fixed_str = NULL;
+	char  lang[ULOC_LANG_CAPACITY];
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(ERROR,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+	}
+
+	if (strcmp(lang, "und") == 0)
+	{
+		const char *remainder = loc_str + strlen("und");
+
+		fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
+		strcpy(fixed_str, "root");
+		strcat(fixed_str, remainder);
+
+		loc_str = fixed_str;
+	}
+#endif
+
 	status = U_ZERO_ERROR;
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
@@ -1439,9 +1471,16 @@ pg_ucol_open(const char *loc_str)
 	icu_set_collation_attributes(collator, loc_str);
 #endif
 
+#if U_ICU_VERSION_MAJOR_NUM < 55
+	if (fixed_str != NULL)
+		pfree(fixed_str);
+#endif
+
 	return collator;
 }
 
+#endif
+
 void
 make_icu_collator(const char *iculocstr,
 				  const char *icurules,
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 68d430ed63..d48b7b6060 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd)
 	 * that they win if libc defines a locale with the same name.
 	 */
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
-				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n",
+				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n",
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
 
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)"
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6225b575ce..f135200c99 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  t
 (1 row)
 
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 64cbfd0a5b..8105ebc8ae 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
 
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
-- 
2.34.1

v6-0004-Add-SQL-function-pg_icu_language_tag.patchtext/x-patch; charset=UTF-8; name=v6-0004-Add-SQL-function-pg_icu_language_tag.patchDownload

From c9cd8b79eae4fbd6175f1c75ee23fc09ec1a3d99 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 11:27:12 -0700
Subject: [PATCH v6 4/6] Add SQL function pg_icu_language_tag().

Account for locales "C" and "POSIX", which correspond to the
language tag "en-US-u-va-posix".

Also, don't rely on a fixed-size buffer for language tags, as there is
no defined upper limit (cf. RFC5646 section 4.4).
---
 doc/src/sgml/func.sgml                        | 15 ++++
 src/backend/commands/collationcmds.c          | 44 ++++++------
 src/backend/utils/adt/pg_locale.c             | 68 +++++++++++++++++++
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/catalog/catversion.h              |  2 +-
 src/include/catalog/pg_proc.dat               |  5 ++
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 55 +++++++++++++++
 src/test/regress/sql/collate.icu.utf8.sql     | 10 +++
 9 files changed, 180 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 15314aa3ee..dd073b7e26 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27453,6 +27453,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_icu_language_tag</primary>
+        </indexterm>
+        <function>pg_icu_language_tag</function> ( <parameter>locale</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Canonicalizes the given <parameter>locale</parameter> string into a
+        BCP 47 language tag (see <xref
+        linkend="collation-managing-create-icu"/>).
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index b8f2e7059f..c6dea866da 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -576,26 +576,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -957,7 +937,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, false);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
@@ -1014,3 +994,25 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(ncreated);
 }
+
+/*
+ * pg_icu_language_tag
+ *
+ * Return the BCP47 language tag representation of the given locale string.
+ */
+Datum
+pg_icu_language_tag(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	text	*locale_text = PG_GETARG_TEXT_PP(0);
+	char	*locale_cstr = text_to_cstring(locale_text);
+	char	*langtag	 = icu_language_tag(locale_cstr, false);
+
+	PG_RETURN_TEXT_P(cstring_to_text(langtag));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 11c6c25c15..d442cc37a0 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2819,6 +2819,74 @@ check_icu_locale(const char *icu_locale)
 #endif
 }
 
+#ifdef USE_ICU
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, bool noError)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(ERROR,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen || buflen >= MaxAllocSize)
+			break;
+
+		buflen = Min(buflen * 2, MaxAllocSize);
+		langtag = repalloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+		if (noError)
+			return NULL;
+
+		ereport(ERROR,
+				(errmsg("could not convert locale name \"%s\" to language tag: %s",
+						loc_str, u_errorName(status))));
+	}
+
+	return langtag;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index b2eed22d46..8b205ab161 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303151
+#define CATALOG_VERSION_NO	202303161
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fbc4aade49..063f77f70c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11817,6 +11817,11 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '6273',
+  descr => 'get BCP47 language tag representation of locale',
+  proname => 'pg_icu_language_tag', procost => '100',
+  provolatile => 's', prorettype => 'text', proargtypes => 'text',
+  prosrc => 'pg_icu_language_tag' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index f9ce428233..999122b6f4 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -119,6 +119,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, bool noError);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..ae608c29be 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -11,6 +11,61 @@ SELECT getdatabaseencoding() <> 'UTF8' OR
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US');
+ pg_icu_language_tag 
+---------------------
+ en-US
+(1 row)
+
+SELECT pg_icu_language_tag('nonsense');
+ pg_icu_language_tag 
+---------------------
+ nonsense
+(1 row)
+
+SELECT pg_icu_language_tag('C.UTF-8');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('POSIX');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('en_US_POSIX');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_icu_language_tag('@colStrength=secondary');
+ pg_icu_language_tag 
+---------------------
+ und-u-ks-level2
+(1 row)
+
+SELECT pg_icu_language_tag('');
+ pg_icu_language_tag 
+---------------------
+ und
+(1 row)
+
+SELECT pg_icu_language_tag('fr_CA.UTF-8');
+ pg_icu_language_tag 
+---------------------
+ fr-CA
+(1 row)
+
+SELECT pg_icu_language_tag('en_US@colStrength=primary');
+ pg_icu_language_tag 
+---------------------
+ en-US-u-ks-level1
+(1 row)
+
 CREATE TABLE collate_test1 (
     a int,
     b text COLLATE "en-x-icu" NOT NULL
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..1cd823ef37 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -15,6 +15,16 @@ SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
 
+-- test language tag canonicalization
+SELECT pg_icu_language_tag('en_US');
+SELECT pg_icu_language_tag('nonsense');
+SELECT pg_icu_language_tag('C.UTF-8');
+SELECT pg_icu_language_tag('POSIX');
+SELECT pg_icu_language_tag('en_US_POSIX');
+SELECT pg_icu_language_tag('@colStrength=secondary');
+SELECT pg_icu_language_tag('');
+SELECT pg_icu_language_tag('fr_CA.UTF-8');
+SELECT pg_icu_language_tag('en_US@colStrength=primary');
 
 CREATE TABLE collate_test1 (
     a int,
-- 
2.34.1

v6-0005-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v6-0005-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From e8e7c032e7b99d292fb9953775620cbebd32f344 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v6 5/6] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 38 +++++++++++
 src/backend/commands/dbcommands.c             | 32 ++++++++++
 src/backend/utils/adt/pg_locale.c             |  9 ---
 src/bin/initdb/initdb.c                       | 63 +++++++++++++++++++
 .../regress/expected/collate.icu.utf8.out     | 25 +++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 13 ++++
 6 files changed, 170 insertions(+), 10 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c6dea866da..bc0e48d507 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				langtag = icu_language_tag(colliculocale, true);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..5935477f44 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1059,6 +1062,35 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			langtag = icu_language_tag(dbiculocale, true);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index d442cc37a0..45a7f77b2d 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2799,27 +2799,18 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 }
 #endif
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 
 	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
-#ifdef USE_ICU
 /*
  * Return the BCP47 language tag representation of the requested locale.
  *
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d48b7b6060..ccabfe00b6 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2039,6 +2039,64 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen)
+			break;
+
+		buflen = buflen * 2;
+		langtag = pg_realloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+#endif
+
 /*
  * Check that ICU accepts the locale name; or if not specified, retrieve the
  * default ICU locale.
@@ -2049,6 +2107,7 @@ check_icu_locale(void)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	char		*langtag;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2080,6 +2139,10 @@ check_icu_locale(void)
 	}
 
 	ucol_close(collator);
+
+	langtag = icu_language_tag(icu_locale);
+	pg_free(icu_locale);
+	icu_locale = langtag;
 #endif
 }
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index ae608c29be..9715c12c93 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1074,6 +1074,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1088,9 +1089,11 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1217,14 +1220,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1232,7 +1239,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1240,13 +1249,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1258,6 +1270,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1266,6 +1279,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1293,10 +1307,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1346,7 +1363,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1361,6 +1380,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1368,6 +1388,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1828,7 +1849,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 1cd823ef37..ddd1ab7e66 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -367,6 +367,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -380,6 +382,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
@@ -464,10 +469,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -477,7 +486,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -666,7 +677,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

v6-0006-Validate-ICU-locales.patchtext/x-patch; charset=UTF-8; name=v6-0006-Validate-ICU-locales.patchDownload

From 6d0de521a41f32037f13d242f30cac7a2c16ae53 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v6 6/6] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), and also that the language exists in
ICU.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important in ICU versions 54 and earlier, where
the same (misspelled) locale string could fall back to different
locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      | 17 +++++
 src/backend/commands/collationcmds.c          |  8 +--
 src/backend/commands/dbcommands.c             |  8 +--
 src/backend/utils/adt/pg_locale.c             | 64 +++++++++++++++++++
 src/backend/utils/misc/guc_tables.c           | 10 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 10 ++-
 src/test/regress/sql/collate.icu.utf8.sql     |  6 +-
 9 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..44e9924f6b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9776,6 +9776,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Validation is performed on an ICU locale specified for a new collation
+        or database. If this parameter is set to <literal>true</literal>, an
+        error is raised for a validation failure; if set to
+        <literal>false</literal>, a warning is issued. The default is
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index bc0e48d507..592718fb50 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -284,12 +284,8 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 
 					colliculocale = langtag;
 				}
-				else
-				{
-					ereport(WARNING,
-							(errmsg("could not convert locale \"%s\" to language tag",
-									colliculocale)));
-				}
+
+				icu_validate_locale(colliculocale);
 			}
 #else
 			ereport(ERROR,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5935477f44..600b3a0f61 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1079,12 +1079,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 				dbiculocale = langtag;
 			}
-			else
-			{
-				ereport(WARNING,
-						(errmsg("could not convert locale \"%s\" to language tag",
-								dbiculocale)));
-			}
+
+			icu_validate_locale(dbiculocale);
 		}
 #else
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 45a7f77b2d..b655bdc142 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -88,6 +88,7 @@
 
 #define		MAX_L10N_DATA		80
 
+extern bool icu_locale_validation;
 
 /* GUC settings */
 char	   *locale_messages;
@@ -2876,6 +2877,69 @@ icu_language_tag(const char *loc_str, bool noError)
 	return langtag;
 }
 
+/*
+ * Perform best-effort check that the locale is a valid one.
+ */
+void
+icu_validate_locale(const char *loc_str)
+{
+	UErrorCode	 status;
+	int			 elevel	 = icu_locale_validation ? ERROR : WARNING;
+	char		*langtag = icu_language_tag(loc_str, true);
+	char		 lang[ULOC_LANG_CAPACITY];
+
+	/* check that it can be converted to a language tag */
+	if (langtag == NULL)
+	{
+		ereport(elevel,
+				(errmsg("could not convert locale \"%s\" to language tag",
+						loc_str)));
+		return;
+	}
+	pfree(langtag);
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(elevel,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special languages */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			return;
+	}
+
+	ereport(elevel,
+			(errmsg("language \"%s\" of locale \"%s\" not found",
+					lang, loc_str)));
+}
+
 #endif							/* USE_ICU */
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..1c63ed0d21 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -481,6 +481,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		icu_locale_validation = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Raise an error for invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 999122b6f4..125fa7f4a9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -120,6 +120,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
 extern char *icu_language_tag(const char *loc_str, bool noError);
+extern void icu_validate_locale(const char *loc_str);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9715c12c93..83253b9666 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1092,8 +1092,16 @@ $$;
 RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
 NOTICE:  using language tag "nonsense" for locale "nonsense"
+ERROR:  language "nonsense" of locale "nonsense" not found
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale "@colStrength=primary;nonsense=yes" to language tag
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
+WARNING:  language "nonsense" of locale "nonsense" not found
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index ddd1ab7e66..a8d9917208 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -386,7 +386,11 @@ $$;
 RESET client_min_messages;
 
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

#28

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#27)

Re: ICU locale validation / canonicalization

On 17.03.23 18:55, Jeff Davis wrote:

On Wed, 2023-03-15 at 15:18 -0700, Jeff Davis wrote:

I left out the validation patch for now, and I'm evaluating a
different
approach that will attempt to match to the locales retrieved with
uloc_countAvailable()/uloc_getAvailable().

I like this approach, attached new patch series with that included as
0006.

I have looked at the first three patches. I think we want what those
patches do.

[PATCH v6 1/6] Support language tags in older ICU versions (53 and
earlier).

In pg_import_system_collations(), this is now redundant and can be
simplified:

-		if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+		if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))

icu_set_collation_attributes() needs more commenting about what is going
on. My guess is that uloc_canonicalize() converts from language tag to
ICU locale ID, and then the existing logic to parse that apart would
apply. Is that how it works?

[PATCH v6 2/6] Wrap ICU ucol_open().

It makes sense to try to unify some of this. But I find the naming
confusing. If I see pg_ucol_open(), then I would expect that all calls
to ucol_open() would be replaced by this. But here it's only a few,
without explanation. (pg_ucol_open() has no explanation at all AFAICT.)

I have in my notes that check_icu_locale() and make_icu_collator()
should be combined into a single function. I think that would be a
better way to slice it.

Btw., I had intentionally not written code like this

+#if U_ICU_VERSION_MAJOR_NUM < 54
+	icu_set_collation_attributes(collator, loc_str);
+#endif

The disadvantage of doing it that way is that you then need to dig out
an old version of ICU in order to check whether the code compiles at
all. With the current code, you can be sure that that code compiles if
you make changes elsewhere.

[PATCH v6 3/6] Handle the "und" locale in ICU versions 54 and older.

This makes sense, but the same comment about not #if'ing out code for
old ICU versions applies here.

The

+#ifdef USE_ICU
+

before pg_ucol_open() probably belongs in patch 2.

#29

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#28)

6 attachment(s)

Re: ICU locale validation / canonicalization

On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:

[PATCH v6 1/6] Support language tags in older ICU versions (53 and
earlier).

In pg_import_system_collations(), this is now redundant and can be
simplified:
-               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
icu_set_collation_attributes() needs more commenting about what is
going
on. My guess is that uloc_canonicalize() converts from language tag
to
ICU locale ID, and then the existing logic to parse that apart would
apply. Is that how it works?

Fixed the redundancy, added some comments, and committed 0001.

[PATCH v6 2/6] Wrap ICU ucol_open().

It makes sense to try to unify some of this. But I find the naming
confusing. If I see pg_ucol_open(), then I would expect that all
calls
to ucol_open() would be replaced by this. But here it's only a few,
without explanation. (pg_ucol_open() has no explanation at all
AFAICT.)

The remaining callsite which doesn't use the wrapper is in initdb.c,
which can't call into pg_locale.c, and has different intentions. initdb
uses ucol_open to get the default locale if icu_locale is not
specified; and it also uses ucol open to verify that the locale can be
opened (whether specified or the default). (Aside: I created a tiny
0004 patch which makes this difference more clear and adds a nice
comment.)

There's no reason to use a wrapper when getting the default locale,
because it's just passing NULL anyway.

When verifying that the locale can be opened, ucol_open() doesn't catch
many problems anyway, so I'm not sure it's worth a lot of effort to
copy these extra checks that the wrapper does into initdb.c. For
instance, what's the value in replacing "und" with "root" if opening
either will succeed? Parsing the attributes can potentially catch
problems, but the later patch 0006 will check the attributes when
converting to a language tag at initdb time.

So I'm inclined to just leave initdb alone in patches 0002 and 0003.

I have in my notes that check_icu_locale() and make_icu_collator()
should be combined into a single function. I think that would be a
better way to slice it.

That would leave out get_collation_actual_version(), which should
handle the same fixups for attributes and the "und" locale.

Btw., I had intentionally not written code like this
+#if U_ICU_VERSION_MAJOR_NUM < 54
+       icu_set_collation_attributes(collator, loc_str);
+#endif
The disadvantage of doing it that way is that you then need to dig
out
an old version of ICU in order to check whether the code compiles at
all. With the current code, you can be sure that that code compiles
if
you make changes elsewhere.

I was wondering about that -- thank you, I changed it back to use "if"
rather than "#ifdef".

New series attached (starting at 0002 to better correspond to the
previous series).

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v7-0002-Wrap-ICU-ucol_open.patchtext/x-patch; charset=UTF-8; name=v7-0002-Wrap-ICU-ucol_open.patchDownload

From fbe03dc596b5e12f4dda60269e044caa58f8be32 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 21:21:17 -0700
Subject: [PATCH v7 2/7] Wrap ICU ucol_open().

Hide details of supporting older ICU versions in a wrapper
function. The current code only needs to handle
icu_set_collation_attributes(), but a subsequent commit will add
additional version-specific code.
---
 src/backend/utils/adt/pg_locale.c | 70 +++++++++++++++++++------------
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c3ede994be..dd0786dff5 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -140,6 +140,7 @@ static char *IsoLocaleName(const char *);
  */
 static UConverter *icu_converter = NULL;
 
+static UCollator *pg_ucol_open(const char *loc_str);
 static void init_icu_converter(void);
 static size_t uchar_length(UConverter *converter,
 						   const char *str, int32_t len);
@@ -1430,17 +1431,8 @@ make_icu_collator(const char *iculocstr,
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
 
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
+	collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1451,6 +1443,7 @@ make_icu_collator(const char *iculocstr,
 		const UChar *default_rules;
 		UChar	   *agg_rules;
 		UChar	   *my_rules;
+		UErrorCode	status;
 		int32_t		length;
 
 		default_rules = ucol_getRules(collator, &length);
@@ -1722,16 +1715,11 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	if (collprovider == COLLPROVIDER_ICU)
 	{
 		UCollator  *collator;
-		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
 
-		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
-		if (U_FAILURE(status))
-			ereport(ERROR,
-					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
+		collator = pg_ucol_open(collcollate);
+
 		ucol_getVersion(collator, versioninfo);
 		ucol_close(collator);
 
@@ -2505,6 +2493,43 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 }
 
 #ifdef USE_ICU
+
+/*
+ * Wrapper around ucol_open() to handle API differences for older ICU
+ * versions.
+ */
+static UCollator *
+pg_ucol_open(const char *loc_str)
+{
+	UCollator  *collator;
+	UErrorCode	status;
+
+	/*
+	 * Must never open default collator, because it depends on the environment
+	 * and may change at any time.
+	 *
+	 * NB: the default collator is not the same as the collator for the root
+	 * locale. The root locale may be specified as the empty string, "und", or
+	 * "root". The default collator is opened by passing NULL to ucol_open().
+	 */
+	if (loc_str == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("opening default collator is not supported")));
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(loc_str, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("could not open collator for locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+
+	if (U_ICU_VERSION_MAJOR_NUM < 54)
+		icu_set_collation_attributes(collator, loc_str);
+
+	return collator;
+}
+
 static void
 init_icu_converter(void)
 {
@@ -2771,17 +2796,8 @@ check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
 	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
 
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
+	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
 #else
 	ereport(ERROR,
-- 
2.34.1

v7-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchtext/x-patch; charset=UTF-8; name=v7-0003-Handle-the-und-locale-in-ICU-versions-54-and-olde.patchDownload

From 8027572146571609927815d0fe14f761fc86cf2c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 14 Mar 2023 22:28:21 -0700
Subject: [PATCH v7 3/7] Handle the "und" locale in ICU versions 54 and older.

The "und" locale is an alternative spelling of the root locale, but it
was not recognized until ICU 55. To maintain common behavior across
all supported ICU versions, check for "und" and replace with "root"
before opening.

Previously, the lack of support for "und" was dangerous, because
versions 54 and older fall back to the environment when a locale is
not found. If the user specified "und" for the language (which is
expected and documented), it could not only resolve to the wrong
collator, but it could unexpectedly change (which could lead to
corrupt indexes).

This effectively reverts commit d72900bded, which worked around the
problem for the built-in "unicode" collation, and is no longer
necessary.

Discussion: https://postgr.es/m/60da0cecfb512a78b8666b31631a636215d8ce73.camel@j-davis.com
Discussion: https://postgr.es/m/0c6fa66f2753217d2a40480a96bd2ccf023536a1.camel@j-davis.com
---
 src/backend/utils/adt/pg_locale.c             | 38 ++++++++++++++++++-
 src/bin/initdb/initdb.c                       |  2 +-
 .../regress/expected/collate.icu.utf8.out     |  7 ++++
 src/test/regress/sql/collate.icu.utf8.sql     |  2 +
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index dd0786dff5..052db11413 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2501,8 +2501,9 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 static UCollator *
 pg_ucol_open(const char *loc_str)
 {
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode	 status;
+	char		*fixed_str = NULL;
 
 	/*
 	 * Must never open default collator, because it depends on the environment
@@ -2517,6 +2518,36 @@ pg_ucol_open(const char *loc_str)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("opening default collator is not supported")));
 
+	/*
+	 * In ICU versions 54 and earlier, "und" is not a recognized spelling of
+	 * the root locale. If the first component of the locale is "und", replace
+	 * with "root" before opening.
+	 */
+	if (U_ICU_VERSION_MAJOR_NUM < 55)
+	{
+		char  lang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(ERROR,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		}
+
+		if (strcmp(lang, "und") == 0)
+		{
+			const char *remainder = loc_str + strlen("und");
+
+			fixed_str = palloc(strlen("root") + strlen(remainder) + 1);
+			strcpy(fixed_str, "root");
+			strcat(fixed_str, remainder);
+
+			loc_str = fixed_str;
+		}
+	}
+
 	status = U_ZERO_ERROR;
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
@@ -2527,6 +2558,9 @@ pg_ucol_open(const char *loc_str)
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
 		icu_set_collation_attributes(collator, loc_str);
 
+	if (fixed_str != NULL)
+		pfree(fixed_str);
+
 	return collator;
 }
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 68d430ed63..d48b7b6060 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1498,7 +1498,7 @@ setup_collation(FILE *cmdfd)
 	 * that they win if libc defines a locale with the same name.
 	 */
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, colliculocale)"
-				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, '');\n\n",
+				  "VALUES (pg_nextoid('pg_catalog.pg_collation', 'oid', 'pg_catalog.pg_collation_oid_index'), 'unicode', 'pg_catalog'::regnamespace, %u, '%c', true, -1, 'und');\n\n",
 				  BOOTSTRAP_SUPERUSERID, COLLPROVIDER_ICU);
 
 	PG_CMD_PRINTF("INSERT INTO pg_collation (oid, collname, collnamespace, collowner, collprovider, collisdeterministic, collencoding, collcollate, collctype)"
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 6225b575ce..f135200c99 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1312,6 +1312,13 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  t
 (1 row)
 
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
+ ?column? 
+----------
+ t
+(1 row)
+
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
 CREATE TABLE test3cs (x text COLLATE case_sensitive);
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 64cbfd0a5b..8105ebc8ae 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -521,6 +521,8 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
+CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
 
 CREATE TABLE test1cs (x text COLLATE case_sensitive);
 CREATE TABLE test2cs (x text COLLATE case_sensitive);
-- 
2.34.1

v7-0004-Accept-C-POSIX-locales-when-converting-to-languag.patchtext/x-patch; charset=UTF-8; name=v7-0004-Accept-C-POSIX-locales-when-converting-to-languag.patchDownload

From 537a53c0504b655a3d91156bfff3d5effcf21e06 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 11:27:12 -0700
Subject: [PATCH v7 4/7] Accept C/POSIX locales when converting to language
 tag.

Account for locales "C" and "POSIX", which correspond to the
language tag "en-US-u-va-posix".

Add a SQL function pg_language_tag() that performs the conversion.

Also, don't rely on a fixed-size buffer for language tags, as there is
no defined upper limit (cf. RFC 5646 section 4.4).
---
 doc/src/sgml/func.sgml                        | 15 ++++
 src/backend/commands/collationcmds.c          | 44 ++++++------
 src/backend/utils/adt/pg_locale.c             | 68 +++++++++++++++++++
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/catalog/catversion.h              |  2 +-
 src/include/catalog/pg_proc.dat               |  5 ++
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 55 +++++++++++++++
 src/test/regress/sql/collate.icu.utf8.sql     | 10 +++
 9 files changed, 180 insertions(+), 24 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a3a13b895f..35cecc24c8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27530,6 +27530,21 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset
         Use of this function is restricted to superusers.
        </para></entry>
       </row>
+
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_language_tag</primary>
+        </indexterm>
+        <function>pg_language_tag</function> ( <parameter>locale</parameter> <type>text</type> )
+        <returnvalue>text</returnvalue>
+       </para>
+       <para>
+        Canonicalizes the given <parameter>locale</parameter> string into a
+        BCP 47 language tag (see <xref
+        linkend="collation-managing-create-icu"/>).
+       </para></entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 3d0aea0568..ca1d46669f 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -576,26 +576,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -957,7 +937,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, false);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
@@ -1014,3 +994,25 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(ncreated);
 }
+
+/*
+ * pg_language_tag
+ *
+ * Return the BCP47 language tag representation of the given locale string.
+ */
+Datum
+pg_language_tag(PG_FUNCTION_ARGS)
+{
+#ifdef USE_ICU
+	text	*locale_text = PG_GETARG_TEXT_PP(0);
+	char	*locale_cstr = text_to_cstring(locale_text);
+	char	*langtag	 = icu_language_tag(locale_cstr, false);
+
+	PG_RETURN_TEXT_P(cstring_to_text(langtag));
+#else
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+	PG_RETURN_NULL();
+#endif
+}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 052db11413..baafc71a3d 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2840,6 +2840,74 @@ check_icu_locale(const char *icu_locale)
 #endif
 }
 
+#ifdef USE_ICU
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, bool noError)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(ERROR,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen || buflen >= MaxAllocSize)
+			break;
+
+		buflen = Min(buflen * 2, MaxAllocSize);
+		langtag = repalloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+		if (noError)
+			return NULL;
+
+		ereport(ERROR,
+				(errmsg("could not convert locale name \"%s\" to language tag: %s",
+						loc_str, u_errorName(status))));
+	}
+
+	return langtag;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index e94528a7c7..d993539dfe 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202303181
+#define CATALOG_VERSION_NO	202303211
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf87aeb2c..43db94557d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11838,6 +11838,11 @@
   proname => 'pg_database_collation_actual_version', procost => '100',
   provolatile => 'v', prorettype => 'text', proargtypes => 'oid',
   prosrc => 'pg_database_collation_actual_version' },
+{ oid => '6273',
+  descr => 'get BCP47 language tag representation of locale',
+  proname => 'pg_language_tag', procost => '100',
+  provolatile => 's', prorettype => 'text', proargtypes => 'text',
+  prosrc => 'pg_language_tag' },
 
 # system management/monitoring related functions
 { oid => '3353', descr => 'list files in the log directory',
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index dd822a68be..ae9077c9bc 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -121,6 +121,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, bool noError);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..d8e6240cd7 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -11,6 +11,61 @@ SELECT getdatabaseencoding() <> 'UTF8' OR
 SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
+-- test language tag canonicalization
+SELECT pg_language_tag('en_US');
+ pg_language_tag 
+-----------------
+ en-US
+(1 row)
+
+SELECT pg_language_tag('nonsense');
+ pg_language_tag 
+-----------------
+ nonsense
+(1 row)
+
+SELECT pg_language_tag('C.UTF-8');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('POSIX');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('en_US_POSIX');
+ pg_language_tag  
+------------------
+ en-US-u-va-posix
+(1 row)
+
+SELECT pg_language_tag('@colStrength=secondary');
+ pg_language_tag 
+-----------------
+ und-u-ks-level2
+(1 row)
+
+SELECT pg_language_tag('');
+ pg_language_tag 
+-----------------
+ und
+(1 row)
+
+SELECT pg_language_tag('fr_CA.UTF-8');
+ pg_language_tag 
+-----------------
+ fr-CA
+(1 row)
+
+SELECT pg_language_tag('en_US@colStrength=primary');
+  pg_language_tag  
+-------------------
+ en-US-u-ks-level1
+(1 row)
+
 CREATE TABLE collate_test1 (
     a int,
     b text COLLATE "en-x-icu" NOT NULL
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..c7241c739a 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -15,6 +15,16 @@ SET client_encoding TO UTF8;
 CREATE SCHEMA collate_tests;
 SET search_path = collate_tests;
 
+-- test language tag canonicalization
+SELECT pg_language_tag('en_US');
+SELECT pg_language_tag('nonsense');
+SELECT pg_language_tag('C.UTF-8');
+SELECT pg_language_tag('POSIX');
+SELECT pg_language_tag('en_US_POSIX');
+SELECT pg_language_tag('@colStrength=secondary');
+SELECT pg_language_tag('');
+SELECT pg_language_tag('fr_CA.UTF-8');
+SELECT pg_language_tag('en_US@colStrength=primary');
 
 CREATE TABLE collate_test1 (
     a int,
-- 
2.34.1

v7-0005-initdb-emit-message-when-using-default-ICU-locale.patchtext/x-patch; charset=UTF-8; name=v7-0005-initdb-emit-message-when-using-default-ICU-locale.patchDownload

From c39c714fecd2fcf304e2fc8a120b9f6851cd78db Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 22 Mar 2023 10:06:23 -0700
Subject: [PATCH v7 5/7] initdb: emit message when using default ICU locale.

Also, minor cleanup to separate the code that chooses the default ICU
locale from the code that verifies that a specified locale can be
opened with ucol_open(). This cleanup creates a better place for an
important comment.
---
 src/bin/initdb/initdb.c | 77 ++++++++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index d48b7b6060..7f857f6075 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2039,46 +2039,73 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+
 /*
- * Check that ICU accepts the locale name; or if not specified, retrieve the
- * default ICU locale.
+ * Determine default ICU locale by opening the default collator and reading
+ * its locale.
+ *
+ * NB: The default collator (opened using NULL) is different from the collator
+ * for the root locale (opened with "", "und", or "root"). The former depends
+ * on the environment (useful at initdb time) and the latter does not.
  */
-static void
-check_icu_locale(void)
+static char *
+default_icu_locale(void)
 {
-#ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	const char	*valid_locale;
+	char		*default_locale;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = ucol_open(NULL, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for default locale: %s",
+				 u_errorName(status));
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+										&status);
 	if (U_FAILURE(status))
 	{
-		if (icu_locale)
-			pg_fatal("could not open collator for locale \"%s\": %s",
-					 icu_locale, u_errorName(status));
-		else
-			pg_fatal("could not open collator for default locale: %s",
-					 u_errorName(status));
+		ucol_close(collator);
+		pg_fatal("could not determine default ICU locale");
 	}
 
-	/* if not specified, get locale from default collator */
-	if (icu_locale == NULL)
-	{
-		const char	*default_locale;
+	default_locale = pg_strdup(valid_locale);
 
-		status = U_ZERO_ERROR;
-		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-											  &status);
-		if (U_FAILURE(status))
-		{
-			ucol_close(collator);
-			pg_fatal("could not determine default ICU locale");
-		}
+	ucol_close(collator);
 
-		icu_locale = pg_strdup(default_locale);
+	return default_locale;
+}
+
+#endif
+
+/*
+ * If not specified, assign the default locale. Then check that ICU accepts
+ * the locale.
+ */
+static void
+check_icu_locale(void)
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	/* acquire default locale from the environment, if not specified */
+	if (icu_locale == NULL)
+	{
+		icu_locale = default_icu_locale();
+		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* check that the resulting locale can be opened */
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for locale \"%s\": %s",
+				 icu_locale, u_errorName(status));
+
 	ucol_close(collator);
 #endif
 }
-- 
2.34.1

v7-0006-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v7-0006-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From ba447ed36dc028e9fb1a0a63d392531223bb8ffc Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v7 6/7] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 38 +++++++++++
 src/backend/commands/dbcommands.c             | 32 +++++++++
 src/backend/utils/adt/pg_locale.c             |  9 ---
 src/bin/initdb/initdb.c                       | 68 ++++++++++++++++++-
 .../regress/expected/collate.icu.utf8.out     | 25 ++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 13 ++++
 6 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index ca1d46669f..bcddd1d536 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			check_icu_locale(colliculocale);
+
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				langtag = icu_language_tag(colliculocale, true);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..5935477f44 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1059,6 +1062,35 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					 errmsg("ICU locale must be specified")));
 
 		check_icu_locale(dbiculocale);
+
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			langtag = icu_language_tag(dbiculocale, true);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index baafc71a3d..7a5376dc75 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2820,27 +2820,18 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	pfree(lower_str);
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 
 	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
-#ifdef USE_ICU
 /*
  * Return the BCP47 language tag representation of the requested locale.
  *
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7f857f6075..609bd9ea1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2079,11 +2079,67 @@ default_icu_locale(void)
 	return default_locale;
 }
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+		if (len < buflen)
+			break;
+
+		buflen = buflen * 2;
+		langtag = pg_realloc(langtag, buflen);
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+
 #endif
 
 /*
- * If not specified, assign the default locale. Then check that ICU accepts
- * the locale.
+ * If not specified, assign the default locale. Then convert to a language
+ * tag, and check that ICU accepts it.
  */
 static void
 check_icu_locale(void)
@@ -2091,6 +2147,7 @@ check_icu_locale(void)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	char		*langtag;
 
 	/* acquire default locale from the environment, if not specified */
 	if (icu_locale == NULL)
@@ -2099,6 +2156,13 @@ check_icu_locale(void)
 		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* canonicalize to a language tag */
+	langtag = icu_language_tag(icu_locale);
+	printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+		   langtag, icu_locale);
+	pg_free(icu_locale);
+	icu_locale = langtag;
+
 	/* check that the resulting locale can be opened */
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index d8e6240cd7..730decc4cb 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1074,6 +1074,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1088,9 +1089,11 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1217,14 +1220,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1232,7 +1239,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1240,13 +1249,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1258,6 +1270,7 @@ ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUME
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1266,6 +1279,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1293,10 +1307,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1346,7 +1363,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1361,6 +1380,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1368,6 +1388,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1828,7 +1849,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index c7241c739a..5f3a88a404 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -367,6 +367,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -380,6 +382,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
@@ -464,10 +469,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -477,7 +486,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -666,7 +677,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

v7-0007-Validate-ICU-locales.patchtext/x-patch; charset=UTF-8; name=v7-0007-Validate-ICU-locales.patchDownload

From 9c2929ef42f9d3c53eaf00c4390b087594475740 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v7 7/7] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), and also that the language exists in
ICU.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important in ICU versions 54 and earlier, where
the same (misspelled) locale string could fall back to different
locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      | 17 +++++
 src/backend/commands/collationcmds.c          |  8 +--
 src/backend/commands/dbcommands.c             |  8 +--
 src/backend/utils/adt/pg_locale.c             | 64 +++++++++++++++++++
 src/backend/utils/misc/guc_tables.c           | 10 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 10 ++-
 src/test/regress/sql/collate.icu.utf8.sql     |  6 +-
 9 files changed, 112 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 481f93cea1..78eae3ca65 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9784,6 +9784,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Validation is performed on an ICU locale specified for a new collation
+        or database. If this parameter is set to <literal>true</literal>, an
+        error is raised for a validation failure; if set to
+        <literal>false</literal>, a warning is issued. The default is
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index bcddd1d536..90f7aabc88 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -284,12 +284,8 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 
 					colliculocale = langtag;
 				}
-				else
-				{
-					ereport(WARNING,
-							(errmsg("could not convert locale \"%s\" to language tag",
-									colliculocale)));
-				}
+
+				icu_validate_locale(colliculocale);
 			}
 #else
 			ereport(ERROR,
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5935477f44..600b3a0f61 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1079,12 +1079,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 				dbiculocale = langtag;
 			}
-			else
-			{
-				ereport(WARNING,
-						(errmsg("could not convert locale \"%s\" to language tag",
-								dbiculocale)));
-			}
+
+			icu_validate_locale(dbiculocale);
 		}
 #else
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 7a5376dc75..b651b99707 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -88,6 +88,7 @@
 
 #define		MAX_L10N_DATA		80
 
+extern bool icu_locale_validation;
 
 /* GUC settings */
 char	   *locale_messages;
@@ -2897,6 +2898,69 @@ icu_language_tag(const char *loc_str, bool noError)
 	return langtag;
 }
 
+/*
+ * Perform best-effort check that the locale is a valid one.
+ */
+void
+icu_validate_locale(const char *loc_str)
+{
+	UErrorCode	 status;
+	int			 elevel	 = icu_locale_validation ? ERROR : WARNING;
+	char		*langtag = icu_language_tag(loc_str, true);
+	char		 lang[ULOC_LANG_CAPACITY];
+
+	/* check that it can be converted to a language tag */
+	if (langtag == NULL)
+	{
+		ereport(elevel,
+				(errmsg("could not convert locale \"%s\" to language tag",
+						loc_str)));
+		return;
+	}
+	pfree(langtag);
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(elevel,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special languages */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			return;
+	}
+
+	ereport(elevel,
+			(errmsg("language \"%s\" of locale \"%s\" not found",
+					lang, loc_str)));
+}
+
 #endif							/* USE_ICU */
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..1c63ed0d21 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -481,6 +481,7 @@ char	   *event_source;
 
 bool		row_security;
 bool		check_function_bodies = true;
+bool		icu_locale_validation = false;
 
 /*
  * This GUC exists solely for backward compatibility, check its definition for
@@ -1586,6 +1587,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Raise an error for invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index ae9077c9bc..076665dfc3 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -122,6 +122,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
 extern char *icu_language_tag(const char *loc_str, bool noError);
+extern void icu_validate_locale(const char *loc_str);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 730decc4cb..5eeceb7e02 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1092,8 +1092,16 @@ $$;
 RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
 NOTICE:  using language tag "nonsense" for locale "nonsense"
+ERROR:  language "nonsense" of locale "nonsense" not found
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale "@colStrength=primary;nonsense=yes" to language tag
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
+WARNING:  language "nonsense" of locale "nonsense" not found
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 5f3a88a404..7d2c91252c 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -386,7 +386,11 @@ $$;
 RESET client_min_messages;
 
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

#30

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#29)

Re: ICU locale validation / canonicalization

On 22.03.23 19:05, Jeff Davis wrote:

On Tue, 2023-03-21 at 10:35 +0100, Peter Eisentraut wrote:
[PATCH v6 1/6] Support language tags in older ICU versions (53 and
earlier).

In pg_import_system_collations(), this is now redundant and can be
simplified:
-               if (!pg_is_ascii(langtag) || !pg_is_ascii(iculocstr))
+               if (!pg_is_ascii(langtag) || !pg_is_ascii(langtag))
icu_set_collation_attributes() needs more commenting about what is
going
on. My guess is that uloc_canonicalize() converts from language tag
to
ICU locale ID, and then the existing logic to parse that apart would
apply. Is that how it works?
Fixed the redundancy, added some comments, and committed 0001.

So, does uloc_canonicalize() always convert to ICU locale IDs? What if
you pass a language tag, does it convert it to ICU locale ID as well?

[PATCH v6 2/6] Wrap ICU ucol_open().

So I'm inclined to just leave initdb alone in patches 0002 and 0003.

0002 and 0003 look ok to me now.

In 0002, the error "opening default collator is not supported", should
that be an assert or an elog? Is it reachable by the user?

You might want to check the declarations at the top of pg_ucol_open().
0003 reformats them after they were just added in 0002. Maybe check
that they are pgindent'ed in 0002 properly.

I don't understand patch 0004. It seems to do two things, handle
C/POSIX locale specifications and add an SQL-callable function. Are
those connected?

#31

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#30)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-23 at 07:27 +0100, Peter Eisentraut wrote:

So, does uloc_canonicalize() always convert to ICU locale IDs? What
if
you pass a language tag, does it convert it to ICU locale ID as well?

Yes.

The documentation is not clear on that point, but my testing shows that
it does. And this is only for old versions of the code, so we don't
need to worry about later versions of ICU changing that.

I thought about using uloc_forLanguageTag(), but the documentation for
that is not clear what formats it accepts as an input, so it doesn't
seem like a win. If wanted to be paranoid we could use
uloc_toLanguageTag() followed by uloc_forLanguageTag(), but that seemed
excessive.

0002 and 0003 look ok to me now.

Thank you, committed 0002 and 0003.

In 0002, the error "opening default collator is not supported",
should
that be an assert or an elog? Is it reachable by the user?

It's not reachable by the user, but could catch a bug if we
accidentally read a NULL field from the catalog or something like that.
It seemed a worthwhile check to leave in production builds.

You might want to check the declarations at the top of
pg_ucol_open().
0003 reformats them after they were just added in 0002. Maybe check
that they are pgindent'ed in 0002 properly.

They seem to be pgindented fine in 0002, it was unnecessarily
reindented in 0003 and I fixed that.

I use emacs "align-current" and generally that does the right thing,
but I'll rely more on pgindent in the future.

I don't understand patch 0004. It seems to do two things, handle
C/POSIX locale specifications and add an SQL-callable function. Are
those connected?

It's hard to test (or even exercise) the former without the latter.

I could get rid of the SQL-callable function and move the rest of the
changes into 0006. I'll see if that arrangement works better, and that
way we can add the SQL-callable function later (or perhaps not at all
if it's not desired).

Regards,
Jeff Davis

#32

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#31)

4 attachment(s)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote:

I could get rid of the SQL-callable function and move the rest of the
changes into 0006. I'll see if that arrangement works better, and
that
way we can add the SQL-callable function later (or perhaps not at all
if it's not desired).

Attached a new series that doesn't include the SQL-callable function.
It's probably better to just wait and see what functions seem actually
useful to users.

I included a new small patch to fix a potential UCollator leak and make
the errors more consistent.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v8-0001-Avoid-potential-UCollator-leak-for-older-ICU-vers.patchtext/x-patch; charset=UTF-8; name=v8-0001-Avoid-potential-UCollator-leak-for-older-ICU-vers.patchDownload

From 7046e684b89fdae92fd5ea83fab12c5634cf3ba3 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 23 Mar 2023 21:50:47 -0700
Subject: [PATCH v8 1/4] Avoid potential UCollator leak for older ICU versions.

ICU versions 53 and earlier rely on icu_set_collation_attributes() to
process the attributes in the locale string. Refactor slightly to
avoid leaking the already-opened UCollator object if an error is
encountered.

Also centralize error reporting in pg_ucol_open() and make consistent
between versions.
---
 src/backend/utils/adt/pg_locale.c | 59 ++++++++++++++++++-------------
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 42b6ad45cb..f9bc30b85f 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -147,7 +147,8 @@ static size_t uchar_length(UConverter *converter,
 static int32_t uchar_convert(UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, int32_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(UCollator *collator, const char *loc,
+										 UErrorCode *status);
 #endif
 
 /*
@@ -2503,6 +2504,7 @@ pg_ucol_open(const char *loc_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
+	const char *orig_str = loc_str;
 	char	   *fixed_str = NULL;
 
 	/*
@@ -2552,11 +2554,27 @@ pg_ucol_open(const char *loc_str)
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
+				/* use original string for error report */
 				(errmsg("could not open collator for locale \"%s\": %s",
-						loc_str, u_errorName(status))));
+						orig_str, u_errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, loc_str);
+	{
+		status = U_ZERO_ERROR;
+		icu_set_collation_attributes(collator, loc_str, &status);
+
+		/*
+		 * Pretend the error came from ucol_open(), for consistent error
+		 * message across ICU versions.
+		 */
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							orig_str, u_errorName(status))));
+		}
+	}
 
 	if (fixed_str != NULL)
 		pfree(fixed_str);
@@ -2706,9 +2724,9 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(UCollator *collator, const char *loc,
+							 UErrorCode *status)
 {
-	UErrorCode	status;
 	int32_t		len;
 	char	   *icu_locale_id;
 	char	   *lower_str;
@@ -2721,15 +2739,15 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	 * locale ID, e.g. "und@colcaselevel=yes;colstrength=primary", by
 	 * uloc_canonicalize().
 	 */
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, NULL, 0, &status);
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, status);
 	icu_locale_id = palloc(len + 1);
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
-	if (U_FAILURE(status))
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, status);
+	if (U_FAILURE(*status))
 		ereport(ERROR,
 				(errmsg("canonicalization failed for locale string \"%s\": %s",
-						loc, u_errorName(status))));
+						loc, u_errorName(*status))));
 
 	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
 
@@ -2751,7 +2769,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
 
-			status = U_ZERO_ERROR;
+			*status = U_ZERO_ERROR;
 
 			*e = '\0';
 			name = token;
@@ -2801,19 +2819,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			else if (strcmp(value, "upper") == 0)
 				uvalue = UCOL_UPPER_FIRST;
 			else
-				status = U_ILLEGAL_ARGUMENT_ERROR;
-
-			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+			{
+				*status = U_ILLEGAL_ARGUMENT_ERROR;
+				break;
+			}
 
-			/*
-			 * Pretend the error came from ucol_open(), for consistent error
-			 * message across ICU versions.
-			 */
-			if (U_FAILURE(status))
-				ereport(ERROR,
-						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+			ucol_setAttribute(collator, uattr, uvalue, status);
 		}
 	}
 
-- 
2.34.1

v8-0002-initdb-emit-message-when-using-default-ICU-locale.patchtext/x-patch; charset=UTF-8; name=v8-0002-initdb-emit-message-when-using-default-ICU-locale.patchDownload

From 90e37d306320c0d4c33d12e89b28039570d05b30 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 22 Mar 2023 10:06:23 -0700
Subject: [PATCH v8 2/4] initdb: emit message when using default ICU locale.

Also, minor cleanup to separate the code that chooses the default ICU
locale from the code that verifies that a specified locale can be
opened with ucol_open(). This cleanup creates a better place for an
important comment.
---
 src/bin/initdb/initdb.c | 77 ++++++++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index bae97539fc..9bf094500d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2242,46 +2242,73 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+
 /*
- * Check that ICU accepts the locale name; or if not specified, retrieve the
- * default ICU locale.
+ * Determine default ICU locale by opening the default collator and reading
+ * its locale.
+ *
+ * NB: The default collator (opened using NULL) is different from the collator
+ * for the root locale (opened with "", "und", or "root"). The former depends
+ * on the environment (useful at initdb time) and the latter does not.
  */
-static void
-check_icu_locale(void)
+static char *
+default_icu_locale(void)
 {
-#ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	const char	*valid_locale;
+	char		*default_locale;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = ucol_open(NULL, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for default locale: %s",
+				 u_errorName(status));
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+										&status);
 	if (U_FAILURE(status))
 	{
-		if (icu_locale)
-			pg_fatal("could not open collator for locale \"%s\": %s",
-					 icu_locale, u_errorName(status));
-		else
-			pg_fatal("could not open collator for default locale: %s",
-					 u_errorName(status));
+		ucol_close(collator);
+		pg_fatal("could not determine default ICU locale");
 	}
 
-	/* if not specified, get locale from default collator */
-	if (icu_locale == NULL)
-	{
-		const char	*default_locale;
+	default_locale = pg_strdup(valid_locale);
 
-		status = U_ZERO_ERROR;
-		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-											  &status);
-		if (U_FAILURE(status))
-		{
-			ucol_close(collator);
-			pg_fatal("could not determine default ICU locale");
-		}
+	ucol_close(collator);
 
-		icu_locale = pg_strdup(default_locale);
+	return default_locale;
+}
+
+#endif
+
+/*
+ * If not specified, assign the default locale. Then check that ICU accepts
+ * the locale.
+ */
+static void
+check_icu_locale(void)
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	/* acquire default locale from the environment, if not specified */
+	if (icu_locale == NULL)
+	{
+		icu_locale = default_icu_locale();
+		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* check that the resulting locale can be opened */
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for locale \"%s\": %s",
+				 icu_locale, u_errorName(status));
+
 	ucol_close(collator);
 #endif
 }
-- 
2.34.1

v8-0003-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v8-0003-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From c52d0196aa437ed82a03acf5fb7e94eb6fe41bb6 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v8 3/4] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 60 ++++++++-----
 src/backend/commands/dbcommands.c             | 32 +++++++
 src/backend/utils/adt/pg_locale.c             | 87 +++++++++++++++++--
 src/bin/initdb/initdb.c                       | 79 ++++++++++++++++-
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 28 +++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 13 +++
 8 files changed, 269 insertions(+), 35 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 3d0aea0568..fe811229b3 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -254,10 +259,43 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
+			char	*langtag;
+
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				langtag = icu_language_tag(colliculocale, WARNING);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
+			}
+
+			check_icu_locale(colliculocale);
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
@@ -576,26 +614,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -957,7 +975,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, ERROR);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..436f6ec60d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
+		char	*langtag;
+
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1058,7 +1061,36 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			langtag = icu_language_tag(dbiculocale, WARNING);
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+		}
+
 		check_icu_locale(dbiculocale);
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f9bc30b85f..366ec71b9f 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2831,26 +2831,97 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
-#endif							/* USE_ICU */
-
 /*
  * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
  */
 void
 check_icu_locale(const char *icu_locale)
 {
-#ifdef USE_ICU
 	UCollator  *collator;
 
 	collator = pg_ucol_open(icu_locale);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, int elevel)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(elevel,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if ((status == U_BUFFER_OVERFLOW_ERROR ||
+			 (U_SUCCESS(status) && len >= buflen)) &&
+			buflen < MaxAllocSize)
+		{
+			buflen = Min(buflen * 2, MaxAllocSize);
+			langtag = repalloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+
+		ereport(elevel,
+				(errmsg("could not convert locale name \"%s\" to language tag: %s",
+						loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	return langtag;
+}
+
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9bf094500d..5cd146f1d0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2282,11 +2282,78 @@ default_icu_locale(void)
 	return default_locale;
 }
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if (status == U_BUFFER_OVERFLOW_ERROR ||
+			(U_SUCCESS(status) && len >= buflen))
+		{
+			buflen = buflen * 2;
+			langtag = pg_realloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+
 #endif
 
 /*
- * If not specified, assign the default locale. Then check that ICU accepts
- * the locale.
+ * If not specified, assign the default locale. Then convert to a language
+ * tag, and check that ICU accepts it.
  */
 static void
 check_icu_locale(void)
@@ -2294,6 +2361,7 @@ check_icu_locale(void)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	char		*langtag;
 
 	/* acquire default locale from the environment, if not specified */
 	if (icu_locale == NULL)
@@ -2302,6 +2370,13 @@ check_icu_locale(void)
 		printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 	}
 
+	/* canonicalize to a language tag */
+	langtag = icu_language_tag(icu_locale);
+	printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+		   langtag, icu_locale);
+	pg_free(icu_locale);
+	icu_locale = langtag;
+
 	/* check that the resulting locale can be opened */
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index dd822a68be..471b111650 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -121,6 +121,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, int elevel);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..0d00df165a 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,9 +1034,11 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+NOTICE:  using language tag "nonsense" for locale "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
@@ -1162,14 +1165,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1177,7 +1184,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1185,13 +1194,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1199,10 +1211,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
 (1 row)
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using language tag "und-u-kn-lower" for locale "@colNumeric=lower"
+ERROR:  could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1211,6 +1225,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1238,10 +1253,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1291,7 +1309,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1306,6 +1326,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1313,6 +1334,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1773,7 +1795,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..3bd98868b4 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,6 +372,9 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
 
@@ -454,10 +459,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -467,7 +476,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -656,7 +667,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

v8-0004-Validate-ICU-locales.patchtext/x-patch; charset=UTF-8; name=v8-0004-Validate-ICU-locales.patchDownload

From 1279aa6b7eb92a01cf9a7eaa3f5f15d595f674ed Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v8 4/4] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), that the language exists in ICU, and
that it can be opened.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important to avoid such mistakes in ICU
versions 54 and earlier, where the same (misspelled) locale string
could fall back to different locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      | 17 ++++
 src/backend/commands/collationcmds.c          | 14 ++--
 src/backend/commands/dbcommands.c             | 14 ++--
 src/backend/utils/adt/pg_locale.c             | 79 ++++++++++++++++---
 src/backend/utils/misc/guc_tables.c           |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/utils/pg_locale.h                 |  2 +
 .../regress/expected/collate.icu.utf8.out     | 12 ++-
 src/test/regress/sql/collate.icu.utf8.sql     |  6 +-
 9 files changed, 121 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 481f93cea1..78eae3ca65 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9784,6 +9784,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-locale-validation" xreflabel="icu_locale_validation">
+      <term><varname>icu_locale_validation</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_locale_validation</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Validation is performed on an ICU locale specified for a new collation
+        or database. If this parameter is set to <literal>true</literal>, an
+        error is raised for a validation failure; if set to
+        <literal>false</literal>, a warning is issued. The default is
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index fe811229b3..2505ed559d 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -273,7 +273,9 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 			 */
 			if (!IsBinaryUpgrade)
 			{
-				langtag = icu_language_tag(colliculocale, WARNING);
+				int	elevel = icu_locale_validation ? ERROR : WARNING;
+
+				langtag = icu_language_tag(colliculocale, elevel);
 				if (langtag)
 				{
 					ereport(NOTICE,
@@ -282,15 +284,9 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 
 					colliculocale = langtag;
 				}
-				else
-				{
-					ereport(WARNING,
-							(errmsg("could not convert locale \"%s\" to language tag",
-									colliculocale)));
-				}
-			}
 
-			check_icu_locale(colliculocale);
+				icu_validate_locale(colliculocale);
+			}
 #else
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 436f6ec60d..40276af143 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1068,7 +1068,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		 */
 		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
 		{
-			langtag = icu_language_tag(dbiculocale, WARNING);
+			int	elevel = icu_locale_validation ? ERROR : WARNING;
+
+			langtag = icu_language_tag(dbiculocale, elevel);
 			if (langtag)
 			{
 				ereport(NOTICE,
@@ -1077,15 +1079,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 				dbiculocale = langtag;
 			}
-			else
-			{
-				ereport(WARNING,
-						(errmsg("could not convert locale \"%s\" to language tag",
-								dbiculocale)));
-			}
-		}
 
-		check_icu_locale(dbiculocale);
+			icu_validate_locale(dbiculocale);
+		}
 #else
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 366ec71b9f..7b962d31bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -88,13 +88,14 @@
 
 #define		MAX_L10N_DATA		80
 
-
 /* GUC settings */
 char	   *locale_messages;
 char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+bool		icu_locale_validation = false;
+
 /*
  * lc_time localization cache.
  *
@@ -2831,18 +2832,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
-/*
- * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
- */
-void
-check_icu_locale(const char *icu_locale)
-{
-	UCollator  *collator;
-
-	collator = pg_ucol_open(icu_locale);
-	ucol_close(collator);
-}
-
 /*
  * Return the BCP47 language tag representation of the requested locale.
  *
@@ -2920,6 +2909,70 @@ icu_language_tag(const char *loc_str, int elevel)
 	return langtag;
 }
 
+/*
+ * Perform best-effort check that the locale is a valid one.
+ */
+void
+icu_validate_locale(const char *loc_str)
+{
+	UCollator	*collator;
+	UErrorCode	 status;
+	int			 elevel	 = icu_locale_validation ? ERROR : WARNING;
+	char		*langtag;
+	char		 lang[ULOC_LANG_CAPACITY];
+	bool		 found	 = false;
+
+	/* check that it can be converted to a language tag */
+	langtag = icu_language_tag(loc_str, elevel);
+	pfree(langtag);
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(elevel,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special language name */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		found = true;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; !found && i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			found = true;
+	}
+
+	if (!found)
+		ereport(elevel,
+				(errmsg("locale \"%s\" has unknown language \"%s\"",
+						loc_str, lang)));
+
+	/* check that it can be opened */
+	collator = pg_ucol_open(loc_str);
+	ucol_close(collator);
+}
+
 #endif							/* USE_ICU */
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..930bbc1877 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1586,6 +1586,15 @@ struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"icu_locale_validation", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Raise an error for invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_locale_validation,
+		false,
+		NULL, NULL, NULL
+	},
 	{
 		{"array_nulls", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS,
 			gettext_noop("Enable input of NULL elements in arrays."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..cff927e8be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,8 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_locale_validation = off		# validate ICU locale strings
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 471b111650..a97735185a 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool	 icu_locale_validation;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -122,6 +123,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
 extern char *icu_language_tag(const char *loc_str, int elevel);
+extern void icu_validate_locale(const char *loc_str);
 #endif
 extern void check_icu_locale(const char *icu_locale);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 0d00df165a..b6b806d753 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1037,8 +1037,16 @@ $$;
 RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
-NOTICE:  using language tag "nonsense" for locale "nonsense"
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
+ERROR:  locale "nonsense-nowhere" has unknown language "nonsense"
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
+WARNING:  locale "nonsense-nowhere" has unknown language "nonsense"
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 3bd98868b4..7871bf5386 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -376,7 +376,11 @@ $$;
 RESET client_min_messages;
 
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+SET icu_locale_validation = true;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+RESET icu_locale_validation;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

#33

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#31)

Re: ICU locale validation / canonicalization

On 23.03.23 18:16, Jeff Davis wrote:

In 0002, the error "opening default collator is not supported",
should
that be an assert or an elog? Is it reachable by the user?

It's not reachable by the user, but could catch a bug if we
accidentally read a NULL field from the catalog or something like that.
It seemed a worthwhile check to leave in production builds.

Then it ought to be an elog().

#34

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#32)

Re: ICU locale validation / canonicalization

On 24.03.23 07:39, Jeff Davis wrote:

On Thu, 2023-03-23 at 10:16 -0700, Jeff Davis wrote:

I could get rid of the SQL-callable function and move the rest of the
changes into 0006. I'll see if that arrangement works better, and
that
way we can add the SQL-callable function later (or perhaps not at all
if it's not desired).

Attached a new series that doesn't include the SQL-callable function.
It's probably better to just wait and see what functions seem actually
useful to users.

I included a new small patch to fix a potential UCollator leak and make
the errors more consistent.

[PATCH v8 1/4] Avoid potential UCollator leak for older ICU versions.

Couldn't we do this in a simpler way by just freeing the collator before
the ereport() calls. Or wrap a PG_TRY/PG_FINALLY around the whole thing?

It would be nicer to not make the callers of
icu_set_collation_attributes() responsible for catching and reporting
the errors.

[PATCH v8 2/4] initdb: emit message when using default ICU locale.

I'm not able to make initdb print this message. Under what
circumstances am I supposed to see this? Do you have some examples?

The function check_icu_locale() has now gotten a lot more functionality
than its name suggests. Maybe the part that assigns the default ICU
locale should be moved up one level to setlocales(), which has a better
name and does something similar for the libc locale realm.

[PATCH v8 3/4] Canonicalize ICU locale names to language tags.

I'm still on the fence about whether we actually want to do this, but
I'm warming up to it, now that the issues with pre-54 versions are fixed.

But if we do this, the documentation needs to be updated. There is a
bunch of text there that says, like, you can do this format or that
format, whatever you like. At least the guidance should be changed there.

[PATCH v8 4/4] Validate ICU locales.

I would make icu_locale_validation true by default.

Or maybe it should be a log-level type option, so you can set it to
error, warning, and also completely off?

#35

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#34)

5 attachment(s)

Re: ICU locale validation / canonicalization

On Fri, 2023-03-24 at 10:10 +0100, Peter Eisentraut wrote:

Couldn't we do this in a simpler way by just freeing the collator
before
the ereport() calls.

I committed a tiny patch to do this.

We still need to address the error inconsistency though. The problem is
that, in older ICU versions, if the fixup for "und@colNumeric=lower" ->
"root@colNumeric=lower" is applied, then icu_set_collation_attributes()
will throw an error reporting "root@colNumeric=lower", which is not
what the user typed.

We could fix that directly by passing the original string to
icu_set_collation_attributes() instead, or perhaps as an extra
parameter used only for the ereport().

I like the minor refactoring I did better, though. It puts the
ereports() close to each other, so any differences are more obvious.
And it seems cleaner to me for pg_ucol_open to close the UCollator
because it's the one that opened it. I don't have a strong opinion, but
that's my reasoning.

Or wrap a PG_TRY/PG_FINALLY around the whole thing?

I generally avoid PG_TRY/FINALLY unless it avoids some major
awkwardness or other problem.

It would be nicer to not make the callers of
icu_set_collation_attributes() responsible for catching and reporting
the errors.

There's only one caller now: pg_ucol_open().

[PATCH v8 2/4] initdb: emit message when using default ICU locale.

I'm not able to make initdb print this message. Under what
circumstances am I supposed to see this? Do you have some examples?

It happens when you don't specify --icu-locale. It is slightly
redundant with "ICU locale", but it lets you see that it came from the
environment rather than the command line:

-------------
$ initdb -D data
The files belonging to this database system will be owned by user
"someone".
This user must also own the server process.

Using default ICU locale "en_US_POSIX".
The database cluster will be initialized with this locale
configuration:
provider: icu
ICU locale: en_US_POSIX
...
-------------

That seems fairly useful for testing, etc., where initdb.log doesn't
show the command line options.

The function check_icu_locale() has now gotten a lot more
functionality
than its name suggests. Maybe the part that assigns the default ICU
locale should be moved up one level to setlocales(), which has a
better
name and does something similar for the libc locale realm.

Agreed, done.

In fact, initdb.c:check_icu_locale() is completely unnecessary in that
patch, because as the comment points out, the backend will try to open
it during post-bootstrap initialization. I think it was simply a
mistake to try to do this validation in commit 27b62377b4.

The later validation patch does do some better validation at initdb
time to make sure the language can be found.

[PATCH v8 3/4] Canonicalize ICU locale names to language tags.

I'm still on the fence about whether we actually want to do this, but
I'm warming up to it, now that the issues with pre-54 versions are
fixed.

But if we do this, the documentation needs to be updated. There is a
bunch of text there that says, like, you can do this format or that
format, whatever you like. At least the guidance should be changed
there.

[PATCH v8 4/4] Validate ICU locales.

I would make icu_locale_validation true by default.

Agreed. I considered also not having a GUC, but it seems like some kind
of escape hatch is wise, at least for now.

Or maybe it should be a log-level type option, so you can set it to
error, warning, and also completely off?

As the validation patch seems closer to acceptance, I changed it to be
before the canonicalization patch. New series attached.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v9-0001-Fix-error-inconsistency-in-older-ICU-versions.patchtext/x-patch; charset=UTF-8; name=v9-0001-Fix-error-inconsistency-in-older-ICU-versions.patchDownload

From 4fdcac30accb54725f05895cb5240af0bb98cb64 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 23 Mar 2023 21:50:47 -0700
Subject: [PATCH v9 1/5] Fix error inconsistency in older ICU versions.

To support older ICU versions, we rely on
icu_set_collation_attributes() to do error checking that is handled
directly by ucol_open() in newer ICU versions. Commit 3b50275b12
introduced a slight inconsistency, where the error report includes the
fixed-up locale string, rather than the locale string passed to
pg_ucol_open().

Refactor slightly so that pg_ucol_open() handles the errors from both
ucol_open() and icu_set_collation_attributes(), making it easier to
see any differences between the error reports. It also makes
pg_ucol_open() responsible for closing the UCollator on error, which
seems like the right place.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
Reviewed-by: Peter Eisentraut
---
 src/backend/utils/adt/pg_locale.c | 60 +++++++++++++++++--------------
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 386768ee76..3db27b34ba 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -147,7 +147,8 @@ static size_t uchar_length(UConverter *converter,
 static int32_t uchar_convert(UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, int32_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(UCollator *collator, const char *loc,
+										 UErrorCode *status);
 #endif
 
 /*
@@ -2503,6 +2504,7 @@ pg_ucol_open(const char *loc_str)
 {
 	UCollator  *collator;
 	UErrorCode	status;
+	const char *orig_str = loc_str;
 	char	   *fixed_str = NULL;
 
 	/*
@@ -2551,11 +2553,27 @@ pg_ucol_open(const char *loc_str)
 	collator = ucol_open(loc_str, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
+				/* use original string for error report */
 				(errmsg("could not open collator for locale \"%s\": %s",
-						loc_str, u_errorName(status))));
+						orig_str, u_errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, loc_str);
+	{
+		status = U_ZERO_ERROR;
+		icu_set_collation_attributes(collator, loc_str, &status);
+
+		/*
+		 * Pretend the error came from ucol_open(), for consistent error
+		 * message across ICU versions.
+		 */
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							orig_str, u_errorName(status))));
+		}
+	}
 
 	if (fixed_str != NULL)
 		pfree(fixed_str);
@@ -2705,9 +2723,9 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(UCollator *collator, const char *loc,
+							 UErrorCode *status)
 {
-	UErrorCode	status;
 	int32_t		len;
 	char	   *icu_locale_id;
 	char	   *lower_str;
@@ -2720,15 +2738,15 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 	 * locale ID, e.g. "und@colcaselevel=yes;colstrength=primary", by
 	 * uloc_canonicalize().
 	 */
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, NULL, 0, &status);
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, NULL, 0, status);
 	icu_locale_id = palloc(len + 1);
-	status = U_ZERO_ERROR;
-	len = uloc_canonicalize(loc, icu_locale_id, len + 1, &status);
-	if (U_FAILURE(status))
+	*status = U_ZERO_ERROR;
+	len = uloc_canonicalize(loc, icu_locale_id, len + 1, status);
+	if (U_FAILURE(*status))
 		ereport(ERROR,
 				(errmsg("canonicalization failed for locale string \"%s\": %s",
-						loc, u_errorName(status))));
+						loc, u_errorName(*status))));
 
 	lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
 
@@ -2750,7 +2768,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			UColAttribute uattr;
 			UColAttributeValue uvalue;
 
-			status = U_ZERO_ERROR;
+			*status = U_ZERO_ERROR;
 
 			*e = '\0';
 			name = token;
@@ -2800,22 +2818,12 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			else if (strcmp(value, "upper") == 0)
 				uvalue = UCOL_UPPER_FIRST;
 			else
-				status = U_ILLEGAL_ARGUMENT_ERROR;
-
-			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
-
-			/*
-			 * Pretend the error came from ucol_open(), for consistent error
-			 * message across ICU versions.
-			 */
-			if (U_FAILURE(status))
 			{
-				ucol_close(collator);
-				ereport(ERROR,
-						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+				*status = U_ILLEGAL_ARGUMENT_ERROR;
+				break;
 			}
+
+			ucol_setAttribute(collator, uattr, uvalue, status);
 		}
 	}
 
-- 
2.34.1

v9-0002-initdb-replace-check_icu_locale-with-default_icu_.patchtext/x-patch; charset=UTF-8; name=v9-0002-initdb-replace-check_icu_locale-with-default_icu_.patchDownload

From a9a899eb97231625e6b36d5d8d45fab4720852bc Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 24 Mar 2023 16:09:40 -0700
Subject: [PATCH v9 2/5] initdb: replace check_icu_locale() with
 default_icu_locale().

The extra checks done in check_icu_locale() are not necessary. An
existing comment already pointed out that the checks would be done
during post-bootstrap initialization, when the locale is opened by the
backend. This was a mistake in commit 27b62377b4.

This commit creates a simpler function default_icu_locale() to just
return the locale of the default collator.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
---
 src/bin/initdb/initdb.c | 61 +++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index bae97539fc..6c1641e77e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2242,49 +2242,47 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+#ifdef USE_ICU
+
 /*
- * Check that ICU accepts the locale name; or if not specified, retrieve the
- * default ICU locale.
+ * Determine default ICU locale by opening the default collator and reading
+ * its locale.
+ *
+ * NB: The default collator (opened using NULL) is different from the collator
+ * for the root locale (opened with "", "und", or "root"). The former depends
+ * on the environment (useful at initdb time) and the latter does not.
  */
-static void
-check_icu_locale(void)
+static char *
+default_icu_locale(void)
 {
-#ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	const char	*valid_locale;
+	char		*default_locale;
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = ucol_open(NULL, &status);
+	if (U_FAILURE(status))
+		pg_fatal("could not open collator for default locale: %s",
+				 u_errorName(status));
+
+	status = U_ZERO_ERROR;
+	valid_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+										&status);
 	if (U_FAILURE(status))
 	{
-		if (icu_locale)
-			pg_fatal("could not open collator for locale \"%s\": %s",
-					 icu_locale, u_errorName(status));
-		else
-			pg_fatal("could not open collator for default locale: %s",
-					 u_errorName(status));
+		ucol_close(collator);
+		pg_fatal("could not determine default ICU locale");
 	}
 
-	/* if not specified, get locale from default collator */
-	if (icu_locale == NULL)
-	{
-		const char	*default_locale;
+	default_locale = pg_strdup(valid_locale);
 
-		status = U_ZERO_ERROR;
-		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
-											  &status);
-		if (U_FAILURE(status))
-		{
-			ucol_close(collator);
-			pg_fatal("could not determine default ICU locale");
-		}
+	ucol_close(collator);
 
-		icu_locale = pg_strdup(default_locale);
-	}
+	return default_locale;
+}
 
-	ucol_close(collator);
 #endif
-}
 
 /*
  * set up the locale variables
@@ -2339,13 +2337,16 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		check_icu_locale();
+#ifdef USE_ICU
+		/* acquire default locale from the environment, if not specified */
+		if (icu_locale == NULL)
+			icu_locale = default_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
 		 * backend during post-bootstrap initialization.
 		 */
-#ifndef USE_ICU
+#else
 		pg_fatal("ICU is not supported in this build");
 #endif
 	}
-- 
2.34.1

v9-0003-initdb-emit-message-when-using-default-ICU-locale.patchtext/x-patch; charset=UTF-8; name=v9-0003-initdb-emit-message-when-using-default-ICU-locale.patchDownload

From c5a2c2ce0f1dc3dc8d83f3d8d257c077b0fa4f96 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 24 Mar 2023 16:09:44 -0700
Subject: [PATCH v9 3/5] initdb: emit message when using default ICU locale.

Helpful to determine from test logs whether the locale came from the
environment or a command-line option.

Discussion: https://postgr.es/m/04182066-7655-344a-b8b7-040b1b2490fb%40enterprisedb.com
---
 src/bin/initdb/initdb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 6c1641e77e..508c522fb7 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2340,7 +2340,10 @@ setlocales(void)
 #ifdef USE_ICU
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
+		{
 			icu_locale = default_icu_locale();
+			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
+		}
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
-- 
2.34.1

v9-0004-Validate-ICU-locales.patchtext/x-patch; charset=UTF-8; name=v9-0004-Validate-ICU-locales.patchDownload

From b27a1a51ea381a5b37cc3b47a93c4d4bb850b727 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 17 Mar 2023 09:55:31 -0700
Subject: [PATCH v9 4/5] Validate ICU locales.

Ensure that it can be transformed into a language tag in "strict" mode
(which validates the attributes), that the language exists in ICU, and
that it can be opened.

Basic validation helps avoid minor mistakes and misspellings, which
often fall back to the root locale instead of the intended
locale. It's even more important to avoid such mistakes in ICU
versions 54 and earlier, where the same (misspelled) locale string
could fall back to different locales depending on the environment.

Discussion: https://postgr.es/m/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com
Discussion: https://postgr.es/m/df2efad0cae7c65180df8e5ebb709e5eb4f2a82b.camel@j-davis.com
---
 doc/src/sgml/config.sgml                      | 26 +++++++
 src/backend/commands/collationcmds.c          | 10 +++
 src/backend/commands/dbcommands.c             | 10 ++-
 src/backend/utils/adt/pg_locale.c             | 70 +++++++++++++++----
 src/backend/utils/misc/guc_tables.c           | 26 +++++++
 src/backend/utils/misc/postgresql.conf.sample |  3 +
 src/bin/initdb/initdb.c                       | 58 ++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                | 18 +++++
 src/include/utils/pg_locale.h                 |  3 +-
 .../regress/expected/collate.icu.utf8.out     |  7 +-
 src/test/regress/sql/collate.icu.utf8.sql     |  5 +-
 11 files changed, 218 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 481f93cea1..56042a0da8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9784,6 +9784,32 @@ SET XML OPTION { DOCUMENT | CONTENT };
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-icu-validation-level" xreflabel="icu_validation_level">
+      <term><varname>icu_validation_level</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>icu_validation_level</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        When ICU locale validation problems are encountered, controls which
+        <link linkend="runtime-config-severity-levels">message level</link> is
+        used to report the problem. Valid values are
+        <literal>DISABLED</literal>, <literal>DEBUG5</literal>,
+        <literal>DEBUG4</literal>, <literal>DEBUG3</literal>,
+        <literal>DEBUG2</literal>, <literal>DEBUG1</literal>,
+        <literal>INFO</literal>, <literal>NOTICE</literal>,
+        <literal>WARNING</literal>, <literal>ERROR</literal>, and
+        <literal>LOG</literal>.
+       </para>
+       <para>
+        If set to <literal>DISABLED</literal>, does not report validation
+        problems at all. Otherwise reports problems at the given message
+        level. The default is <literal>ERROR</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-default-text-search-config" xreflabel="default_text_search_config">
       <term><varname>default_text_search_config</varname> (<type>string</type>)
       <indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 3d0aea0568..2b734807fb 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -254,10 +254,20 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		}
 		else if (collprovider == COLLPROVIDER_ICU)
 		{
+#ifdef USE_ICU
 			if (!colliculocale)
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
+
+			if (!IsBinaryUpgrade)
+				icu_validate_locale(colliculocale);
+
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 4d5d5d6866..9cac2df70e 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 
 	if (dblocprovider == COLLPROVIDER_ICU)
 	{
+#ifdef USE_ICU
 		if (!(is_encoding_supported_by_icu(encoding)))
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -1058,7 +1059,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
-		check_icu_locale(dbiculocale);
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+			icu_validate_locale(dbiculocale);
+
+#else
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ICU is not supported in this build")));
+#endif
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 3db27b34ba..251476ac89 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -95,6 +95,8 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+int			icu_validation_level = ERROR;
+
 /*
  * lc_time localization cache.
  *
@@ -2830,26 +2832,70 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
-#endif							/* USE_ICU */
-
 /*
- * Check if the given locale ID is valid, and ereport(ERROR) if it isn't.
+ * Perform best-effort check that the locale is a valid one.
  */
 void
-check_icu_locale(const char *icu_locale)
+icu_validate_locale(const char *loc_str)
 {
-#ifdef USE_ICU
-	UCollator  *collator;
+	UCollator	*collator;
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	bool		 found	 = false;
+
+	/* no validation */
+	if (icu_validation_level < 0)
+		return;
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		ereport(icu_validation_level,
+				(errmsg("could not get language from locale \"%s\": %s",
+						loc_str, u_errorName(status))));
+		return;
+	}
+
+	/* check for special language name */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		found = true;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; !found && i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			ereport(icu_validation_level,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			found = true;
+	}
 
-	collator = pg_ucol_open(icu_locale);
+	if (!found)
+		ereport(icu_validation_level,
+				(errmsg("locale \"%s\" has unknown language \"%s\"",
+						loc_str, lang)));
+
+	/* check that it can be opened */
+	collator = pg_ucol_open(loc_str);
 	ucol_close(collator);
-#else
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif
 }
 
+#endif							/* USE_ICU */
+
 /*
  * These functions convert from/to libc's wchar_t, *not* pg_wchar_t.
  * Therefore we keep them here rather than with the mbutils code.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..e9dd333f73 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -164,6 +164,22 @@ static const struct config_enum_entry intervalstyle_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry icu_validation_level_options[] = {
+	{"disabled", -1, false},
+	{"debug5", DEBUG5, false},
+	{"debug4", DEBUG4, false},
+	{"debug3", DEBUG3, false},
+	{"debug2", DEBUG2, false},
+	{"debug1", DEBUG1, false},
+	{"debug", DEBUG2, true},
+	{"log", LOG, false},
+	{"info", INFO, true},
+	{"notice", NOTICE, false},
+	{"warning", WARNING, false},
+	{"error", ERROR, false},
+	{NULL, 0, false}
+};
+
 StaticAssertDecl(lengthof(intervalstyle_options) == (INTSTYLE_ISO_8601 + 2),
 				 "array length mismatch");
 
@@ -4630,6 +4646,16 @@ struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"icu_validation_level", PGC_USERSET, CLIENT_CONN_LOCALE,
+		 gettext_noop("Log level for reporting invalid ICU locale strings."),
+		 NULL
+		},
+		&icu_validation_level,
+		ERROR, icu_validation_level_options,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_error_verbosity", PGC_SUSET, LOGGING_WHAT,
 			gettext_noop("Sets the verbosity of logged messages."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86f..d661b93f8c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -730,6 +730,9 @@
 #lc_numeric = 'C'			# locale for number formatting
 #lc_time = 'C'				# locale for time formatting
 
+#icu_validation_level = ERROR		# report ICU locale validation
+					# errors at the given level
+
 # default configuration for text search
 #default_text_search_config = 'pg_catalog.simple'
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 508c522fb7..4db9c310c8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2244,6 +2244,58 @@ check_icu_locale_encoding(int user_enc)
 
 #ifdef USE_ICU
 
+/*
+ * Perform best-effort check that the locale is a valid one. Should be
+ * consistent with pg_locale.c, except that it doesn't need to open the
+ * collator (that will happen during post-bootstrap initialization).
+ */
+static void
+icu_validate_locale(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	bool		 found	 = false;
+
+	/* validate that we can extract the language */
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return;
+	}
+
+	/* check for special language name */
+	if (strcmp(lang, "") == 0 ||
+		strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
+		strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		found = true;
+
+	/* search for matching language within ICU */
+	for (int32_t i = 0; !found && i < uloc_countAvailable(); i++)
+	{
+		const char	*otherloc = uloc_getAvailable(i);
+		char		 otherlang[ULOC_LANG_CAPACITY];
+
+		status = U_ZERO_ERROR;
+		uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
+		if (U_FAILURE(status))
+		{
+			pg_fatal("could not get language from locale \"%s\": %s",
+					 loc_str, u_errorName(status));
+			continue;
+		}
+
+		if (strcmp(lang, otherlang) == 0)
+			found = true;
+	}
+
+	if (!found)
+		pg_fatal("locale \"%s\" has unknown language \"%s\"",
+				 loc_str, lang);
+}
+
 /*
  * Determine default ICU locale by opening the default collator and reading
  * its locale.
@@ -2345,9 +2397,11 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		icu_validate_locale(icu_locale);
+
 		/*
-		 * In supported builds, the ICU locale ID will be checked by the
-		 * backend during post-bootstrap initialization.
+		 * In supported builds, the ICU locale ID will be opened during
+		 * post-bootstrap initialization, which will perform extra checks.
 		 */
 #else
 		pg_fatal("ICU is not supported in this build");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b97420f7e8..db7995fe28 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -128,6 +128,24 @@ if ($ENV{with_icu} eq 'yes')
 		],
 		qr/error: encoding mismatch/,
 		'fails for encoding not supported by ICU');
+
+	command_fails_like(
+		[
+			'initdb',                '--no-sync',
+			'--locale-provider=icu',
+			'--icu-locale=nonsense-nowhere', "$tempdir/dataX"
+		],
+		qr/error: locale "nonsense-nowhere" has unknown language "nonsense"/,
+		'fails for nonsense language');
+
+	command_fails_like(
+		[
+			'initdb',                '--no-sync',
+			'--locale-provider=icu',
+			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
+		],
+		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		'fails for invalid collation argument');
 }
 else
 {
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index dd822a68be..d2252dc95e 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT int	 icu_validation_level;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -121,8 +122,8 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern void icu_validate_locale(const char *loc_str);
 #endif
-extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
 extern size_t wchar2char(char *to, const wchar_t *from, size_t tolen,
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f135200c99..4158b3c15a 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1035,7 +1035,12 @@ END
 $$;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+ERROR:  locale "nonsense-nowhere" has unknown language "nonsense"
+SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+WARNING:  locale "nonsense-nowhere" has unknown language "nonsense"
+RESET icu_validation_level;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8105ebc8ae..95d96f2eb8 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -371,7 +371,10 @@ BEGIN
 END
 $$;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
-CREATE COLLATION testx (provider = icu, locale = 'nonsense'); /* never fails with ICU */  DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+RESET icu_validation_level;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
-- 
2.34.1

v9-0005-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v9-0005-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From d7959ebbf28e9d4054765c86525005f0d7707078 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v9 5/5] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 51 ++++++------
 src/backend/commands/dbcommands.c             | 25 ++++++
 src/backend/utils/adt/pg_locale.c             | 79 +++++++++++++++++++
 src/bin/initdb/initdb.c                       | 77 ++++++++++++++++++
 src/bin/initdb/t/001_initdb.pl                |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 31 +++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 14 ++++
 9 files changed, 257 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 2b734807fb..f34d888208 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -260,9 +265,31 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
 			if (!IsBinaryUpgrade)
-				icu_validate_locale(colliculocale);
+			{
+				char *langtag = icu_language_tag(colliculocale,
+												 icu_validation_level);
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+				else
+				{
+					ereport(WARNING,
+							(errmsg("could not convert locale \"%s\" to language tag",
+									colliculocale)));
+				}
 
+				icu_validate_locale(colliculocale);
+			}
 #else
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -586,26 +613,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -967,7 +974,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, ERROR);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9cac2df70e..e91b14f722 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1059,8 +1059,33 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
 		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			char *langtag = icu_language_tag(dbiculocale,
+											 icu_validation_level);
+
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+			else
+			{
+				ereport(WARNING,
+						(errmsg("could not convert locale \"%s\" to language tag",
+								dbiculocale)));
+			}
+
 			icu_validate_locale(dbiculocale);
+		}
 
 #else
 		ereport(ERROR,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 251476ac89..a337b7a9c2 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2832,6 +2832,85 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 	pfree(lower_str);
 }
 
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, int elevel)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if ((status == U_BUFFER_OVERFLOW_ERROR ||
+			 (U_SUCCESS(status) && len >= buflen)) &&
+			buflen < MaxAllocSize)
+		{
+			buflen = Min(buflen * 2, MaxAllocSize);
+			langtag = repalloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not convert locale name \"%s\" to language tag: %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	return langtag;
+}
+
 /*
  * Perform best-effort check that the locale is a valid one.
  */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4db9c310c8..cd19420c3b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2244,6 +2244,74 @@ check_icu_locale_encoding(int user_enc)
 
 #ifdef USE_ICU
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if (status == U_BUFFER_OVERFLOW_ERROR ||
+			(U_SUCCESS(status) && len >= buflen))
+		{
+			buflen = buflen * 2;
+			langtag = pg_realloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+}
+
 /*
  * Perform best-effort check that the locale is a valid one. Should be
  * consistent with pg_locale.c, except that it doesn't need to open the
@@ -2390,6 +2458,8 @@ setlocales(void)
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
 #ifdef USE_ICU
+		char *langtag;
+
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
 		{
@@ -2397,6 +2467,13 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		/* canonicalize to a language tag */
+		langtag = icu_language_tag(icu_locale);
+		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+			   langtag, icu_locale);
+		pg_free(icu_locale);
+		icu_locale = langtag;
+
 		icu_validate_locale(icu_locale);
 
 		/*
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index db7995fe28..17a444d80c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -144,7 +144,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu',
 			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
 		],
-		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		qr/could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR/,
 		'fails for invalid collation argument');
 }
 else
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index a22f27f300..0b38c0537b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1837,9 +1837,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index d2252dc95e..78f7d5d0d9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -122,6 +122,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
 extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+extern char *icu_language_tag(const char *loc_str, int elevel);
 extern void icu_validate_locale(const char *loc_str);
 #endif
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 4158b3c15a..bb75aa1af6 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,12 +1034,17 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 ERROR:  locale "nonsense-nowhere" has unknown language "nonsense"
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 WARNING:  locale "nonsense-nowhere" has unknown language "nonsense"
 RESET icu_validation_level;
 CREATE COLLATION test4 FROM nonsense;
@@ -1167,14 +1173,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1182,7 +1192,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1190,13 +1202,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1204,10 +1219,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
 (1 row)
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using language tag "und-u-kn-lower" for locale "@colNumeric=lower"
+ERROR:  could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1216,6 +1233,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1243,10 +1261,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1296,7 +1317,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1311,6 +1334,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1318,6 +1342,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1778,7 +1803,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 95d96f2eb8..ed8a4b90ff 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,8 +372,12 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 RESET icu_validation_level;
@@ -457,10 +463,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -470,7 +480,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -659,7 +671,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

#36

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#35)

Re: ICU locale validation / canonicalization

[PATCH v9 1/5] Fix error inconsistency in older ICU versions.

[PATCH v9 2/5] initdb: replace check_icu_locale() with
default_icu_locale().

I would keep the #ifdef USE_ICU inside the lower-level function
default_icu_locale(), like it was before, so that the higher-level
setlocales() doesn't need to know about it.

Otherwise ok.

[PATCH v9 3/5] initdb: emit message when using default ICU locale.

[PATCH v9 4/5] Validate ICU locales.

Also here, let's keep the #ifdef USE_ICU in the lower-level function and
move more logic in there. Otherwise you have to repeat various things
in DefineCollation() and createdb().

I'm not sure we need the IsBinaryUpgrade checks. Users can set
icu_validation_level on the target instance if they don't want that.

#37

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#36)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Tue, 2023-03-28 at 08:41 +0200, Peter Eisentraut wrote:

[PATCH v9 1/5] Fix error inconsistency in older ICU versions.

ok

Committed 0001.

[PATCH v9 2/5] initdb: replace check_icu_locale() with
default_icu_locale().

I would keep the #ifdef USE_ICU inside the lower-level function
default_icu_locale(), like it was before, so that the higher-level
setlocales() doesn't need to know about it.

Otherwise ok.

Done and committed 0002.

[PATCH v9 3/5] initdb: emit message when using default ICU locale.

Done and committed 0003.

[PATCH v9 4/5] Validate ICU locales.

Also here, let's keep the #ifdef USE_ICU in the lower-level function
and
move more logic in there. Otherwise you have to repeat various
things
in DefineCollation() and createdb().

Done.

I'm not sure we need the IsBinaryUpgrade checks. Users can set
icu_validation_level on the target instance if they don't want that.

I committed a version that still performs the checks during binary
upgrade, but degrades the message to a WARNING if it's set higher than
that. I tried some upgrades with invalid locales, and getting an error
deep in the logs after the upgrade actually starts is not very user-
friendly. We could add something during the --check phase, which would
be more helpful, but I didn't do that for this patch.

Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it still
seems like a good idea to get the locales into a standard format in the
catalogs, and if a lot more people start using ICU in v16 (because it's
the default), then it would be a good time to do it. But perhaps there
are risks?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v11-0001-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v11-0001-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From 42900f1dc31d0fe171000265338a5325412e47cf Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v11] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 src/backend/commands/collationcmds.c          | 46 +++++-----
 src/backend/commands/dbcommands.c             | 20 +++++
 src/backend/utils/adt/pg_locale.c             | 85 +++++++++++++++++++
 src/bin/initdb/initdb.c                       | 81 ++++++++++++++++++
 src/bin/initdb/t/001_initdb.pl                |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 33 ++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 15 ++++
 9 files changed, 261 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 45de78352c..6b891e8a47 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -259,6 +264,25 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				char *langtag = icu_language_tag(colliculocale,
+												 icu_validation_level);
+
+				if (langtag)
+				{
+					ereport(NOTICE,
+							(errmsg("using language tag \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+			}
+
 			icu_validate_locale(colliculocale);
 		}
 
@@ -569,26 +593,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -950,7 +954,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, ERROR);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 24bcc5adfe..ca69d4d6dc 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,6 +1058,26 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			char *langtag = icu_language_tag(dbiculocale,
+											 icu_validation_level);
+
+			if (langtag)
+			{
+				ereport(NOTICE,
+						(errmsg("using language tag \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+		}
+
 		icu_validate_locale(dbiculocale);
 	}
 	else
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 9497c20d12..06e73aa012 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2826,6 +2826,91 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 
 #endif
 
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, int elevel)
+{
+#ifdef USE_ICU
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if ((status == U_BUFFER_OVERFLOW_ERROR ||
+			 (U_SUCCESS(status) && len >= buflen)) &&
+			buflen < MaxAllocSize)
+		{
+			buflen = Min(buflen * 2, MaxAllocSize);
+			langtag = repalloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not convert locale name \"%s\" to language tag: %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	return langtag;
+#else							/* not USE_ICU */
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+#endif							/* not USE_ICU */
+}
+
 /*
  * Perform best-effort check that the locale is a valid one.
  */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 208ddc9b30..4814c1c405 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2229,6 +2229,78 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+#ifdef USE_ICU
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if (status == U_BUFFER_OVERFLOW_ERROR ||
+			(U_SUCCESS(status) && len >= buflen))
+		{
+			buflen = buflen * 2;
+			langtag = pg_realloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+#else
+	pg_fatal("ICU is not supported in this build");
+#endif
+}
+
 /*
  * Perform best-effort check that the locale is a valid one. Should be
  * consistent with pg_locale.c, except that it doesn't need to open the
@@ -2376,6 +2448,8 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
+		char *langtag;
+
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
 		{
@@ -2383,6 +2457,13 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		/* canonicalize to a language tag */
+		langtag = icu_language_tag(icu_locale);
+		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+			   langtag, icu_locale);
+		pg_free(icu_locale);
+		icu_locale = langtag;
+
 		icu_validate_locale(icu_locale);
 
 		/*
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index db7995fe28..17a444d80c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -144,7 +144,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu',
 			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
 		],
-		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		qr/could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR/,
 		'fails for invalid collation argument');
 }
 else
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 42215f82f7..df26ba42d6 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1860,9 +1860,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index c275427976..8c095abc52 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -120,6 +120,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
 extern void icu_validate_locale(const char *loc_str);
+extern char *icu_language_tag(const char *loc_str, int elevel);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 5e480d45cd..a2db34e163 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,13 +1034,20 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+NOTICE:  using language tag "nonsense-nowhere" for locale "nonsense-nowhere"
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
 RESET icu_validation_level;
@@ -1169,14 +1177,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1184,7 +1196,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using language tag "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1192,13 +1206,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using language tag "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1206,10 +1223,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
 (1 row)
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using language tag "und-u-kn-lower" for locale "@colNumeric=lower"
+ERROR:  could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using language tag "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1218,6 +1237,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1245,10 +1265,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using language tag "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using language tag "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,7 +1321,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using language tag "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using language tag "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1313,6 +1338,7 @@ SELECT 'abc' <= 'ABC' COLLATE case_insensitive, 'abc' >= 'ABC' COLLATE case_inse
 
 -- test language tags
 CREATE COLLATION lt_insensitive (provider = icu, locale = 'en-u-ks-level1', deterministic = false);
+NOTICE:  using language tag "en-u-ks-level1" for locale "en-u-ks-level1"
 SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
  ?column? 
 ----------
@@ -1320,6 +1346,7 @@ SELECT 'aBcD' COLLATE lt_insensitive = 'AbCd' COLLATE lt_insensitive;
 (1 row)
 
 CREATE COLLATION lt_upperfirst (provider = icu, locale = 'und-u-kf-upper');
+NOTICE:  using language tag "und-u-kf-upper" for locale "und-u-kf-upper"
 SELECT 'Z' COLLATE lt_upperfirst < 'z' COLLATE lt_upperfirst;
  ?column? 
 ----------
@@ -1780,7 +1807,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 95d96f2eb8..85e26951b6 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,9 +372,14 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 RESET icu_validation_level;
 
@@ -457,10 +464,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -470,7 +481,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -659,7 +672,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

#38

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#37)

Re: ICU locale validation / canonicalization

On 30.03.23 04:33, Jeff Davis wrote:

Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it still
seems like a good idea to get the locales into a standard format in the
catalogs, and if a lot more people start using ICU in v16 (because it's
the default), then it would be a good time to do it. But perhaps there
are risks?

I say, let's do it.

I don't think we should show the notice when the canonicalization
doesn't change anything. This is not useful:

+NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf-upper"

Also, the message should be phrased more from the perspective of the
user instead of using ICU jargon, like

NOTICE: using canonicalized form "%s" for locale specification "%s"

(Still too many big words?)

I don't think the special handling of IsBinaryUpgrade is needed or
wanted. I would hope that with this feature, all old-style locale IDs
would go away, but this way we would keep them forever. If we believe
that canonicalization is safe, then I don't see why we cannot apply it
during binary upgrade.

Needs documentation updates in doc/src/sgml/charset.sgml.

#39

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#38)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:

I don't think the special handling of IsBinaryUpgrade is needed or
wanted. I would hope that with this feature, all old-style locale
IDs
would go away, but this way we would keep them forever. If we
believe
that canonicalization is safe, then I don't see why we cannot apply
it
during binary upgrade.

There are two issues:

1. Failures can occur. For instance, if an invalid attribute is used,
like '@collStrength=primary', then we can't canonicalize it (or if we
do, it could end up being not what the user intended).

2. Version 15 and earlier have a subtle bug: it passes the raw locale
straight to ucol_open(), and if the locale is "fr_CA.UTF-8" ucol_open()
mis-parses it to have language "fr" with no region. If you canonicalize
first, it properly parses the locale and produces "fr-CA", which
results in a different collator. The 15 behavior is wrong, and this
canonicalization patch will fix it, but it doesn't do so during
pg_upgrade because that could change the collator and corrupt an index.

The current patch deals with these problems by simply preserving the
locale (valid or not) during pg_upgrade, and only canonicalizing new
collations and databases (so #2 is only fixed for new
collations/databases). I think that's a good trade-off because a lot
more users will be on ICU now that it's the default, so let's avoid
creating more of the problem cases for those new users.

To get to perfectly-canonicalized catalogs for upgrades from earlier
versions:

* We need a way to detect #2, which I posted some code for in an
uncommitted revision[1]See check_equivalent_icu_locales() and calling code here: /messages/by-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com of this patch series.

* We need a way to detect #1 and #2 during the pg_upgrade --check
phase.

* We need actions that the user can take to correct the problems. I
have some ideas but they could use some dicsussion.

I'm not sure all of those will be ready for v16, though.

Regards,
Jeff Davis

[1]: See check_equivalent_icu_locales() and calling code here: /messages/by-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
/messages/by-id/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com

#40

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#38)

1 attachment(s)

Re: ICU locale validation / canonicalization

On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:

I don't think we should show the notice when the canonicalization
doesn't change anything. This is not useful:

+NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf-
upper"

Done.

Also, the message should be phrased more from the perspective of the
user instead of using ICU jargon, like

NOTICE: using canonicalized form "%s" for locale specification "%s"

(Still too many big words?)

Changed to:

NOTICE: using standard form "%s" for locale "%s"

Needs documentation updates in doc/src/sgml/charset.sgml.

I made a very minor update. Do you have something more specific in
mind?

Regards,
Jeff Davis

Attachments:

v11-0001-Canonicalize-ICU-locale-names-to-language-tags.patchtext/x-patch; charset=UTF-8; name=v11-0001-Canonicalize-ICU-locale-names-to-language-tags.patchDownload

From 0069b0a6e6972c91445f8b07b42aff519159c0e8 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 15 Mar 2023 12:37:06 -0700
Subject: [PATCH v11] Canonicalize ICU locale names to language tags.

Convert to BCP47 language tags before storing in the catalog, except
during binary upgrade or when the locale comes from an existing
collation or template database.

Canonicalization is important, because it's able to handle more kinds
of locale strings than ucol_open(). Without canonicalizing first, a
locale string like "fr_CA.UTF-8" will be misinterpreted by
ucol_open().

The resulting language tags can vary slightly between ICU
versions. For instance, "@colBackwards=yes" is converted to
"und-u-kb-true" in older versions of ICU, and to the simpler (but
equivalent) "und-u-kb" in newer versions.

Discussion: https://postgr.es/m/8c7af6820aed94dc7bc259d2aa7f9663518e6137.camel@j-davis.com
---
 doc/src/sgml/charset.sgml                     |  2 +-
 src/backend/commands/collationcmds.c          | 46 +++++-----
 src/backend/commands/dbcommands.c             | 20 +++++
 src/backend/utils/adt/pg_locale.c             | 85 +++++++++++++++++++
 src/bin/initdb/initdb.c                       | 81 ++++++++++++++++++
 src/bin/initdb/t/001_initdb.pl                |  2 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  4 +-
 src/include/utils/pg_locale.h                 |  1 +
 .../regress/expected/collate.icu.utf8.out     | 29 ++++++-
 src/test/regress/sql/collate.icu.utf8.sql     | 15 ++++
 10 files changed, 258 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 12fabb7372..6dd95b8966 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -893,7 +893,7 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
         The first example selects the ICU locale using a <quote>language
         tag</quote> per BCP 47.  The second example uses the traditional
         ICU-specific locale syntax.  The first style is preferred going
-        forward, but it is not supported by older ICU versions.
+        forward, and is used internally to store locales.
        </para>
        <para>
         Note that you can name the collation objects in the SQL environment
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 45de78352c..c91fe66d9b 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -165,6 +165,11 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		else
 			colliculocale = NULL;
 
+		/*
+		 * When the ICU locale comes from an existing collation, do not
+		 * canonicalize to a language tag.
+		 */
+
 		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 		if (!isnull)
 			collicurules = TextDatumGetCString(datum);
@@ -259,6 +264,25 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
+			/*
+			 * During binary upgrade, preserve the locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade)
+			{
+				char *langtag = icu_language_tag(colliculocale,
+												 icu_validation_level);
+
+				if (langtag && strcmp(colliculocale, langtag) != 0)
+				{
+					ereport(NOTICE,
+							(errmsg("using standard form \"%s\" for locale \"%s\"",
+									langtag, colliculocale)));
+
+					colliculocale = langtag;
+				}
+			}
+
 			icu_validate_locale(colliculocale);
 		}
 
@@ -569,26 +593,6 @@ cmpaliases(const void *a, const void *b)
 
 
 #ifdef USE_ICU
-/*
- * Get the ICU language tag for a locale name.
- * The result is a palloc'd string.
- */
-static char *
-get_icu_language_tag(const char *localename)
-{
-	char		buf[ULOC_FULLNAME_CAPACITY];
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
-
-	return pstrdup(buf);
-}
-
 /*
  * Get a comment (specifically, the display name) for an ICU locale.
  * The result is a palloc'd string, or NULL if we can't get a comment
@@ -950,7 +954,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			else
 				name = uloc_getAvailable(i);
 
-			langtag = get_icu_language_tag(name);
+			langtag = icu_language_tag(name, ERROR);
 
 			/*
 			 * Be paranoid about not allowing any non-ASCII strings into
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 24bcc5adfe..2e242eeff2 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,6 +1058,26 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
+		/*
+		 * During binary upgrade, or when the locale came from the template
+		 * database, preserve locale string. Otherwise, canonicalize to a
+		 * language tag.
+		 */
+		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		{
+			char *langtag = icu_language_tag(dbiculocale,
+											 icu_validation_level);
+
+			if (langtag && strcmp(dbiculocale, langtag) != 0)
+			{
+				ereport(NOTICE,
+						(errmsg("using standard form \"%s\" for locale \"%s\"",
+								langtag, dbiculocale)));
+
+				dbiculocale = langtag;
+			}
+		}
+
 		icu_validate_locale(dbiculocale);
 	}
 	else
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 9497c20d12..06e73aa012 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2826,6 +2826,91 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
 
 #endif
 
+/*
+ * Return the BCP47 language tag representation of the requested locale.
+ *
+ * This function should be called before passing the string to ucol_open(),
+ * because conversion to a language tag also performs "level 2
+ * canonicalization". In addition to producing a consistent format, level 2
+ * canonicalization is able to more accurately interpret different input
+ * locale string formats, such as POSIX and .NET IDs.
+ */
+char *
+icu_language_tag(const char *loc_str, int elevel)
+{
+#ifdef USE_ICU
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not get language from locale \"%s\": %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = palloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if ((status == U_BUFFER_OVERFLOW_ERROR ||
+			 (U_SUCCESS(status) && len >= buflen)) &&
+			buflen < MaxAllocSize)
+		{
+			buflen = Min(buflen * 2, MaxAllocSize);
+			langtag = repalloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pfree(langtag);
+
+		if (elevel > 0)
+			ereport(elevel,
+					(errmsg("could not convert locale name \"%s\" to language tag: %s",
+							loc_str, u_errorName(status))));
+		return NULL;
+	}
+
+	return langtag;
+#else							/* not USE_ICU */
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("ICU is not supported in this build")));
+#endif							/* not USE_ICU */
+}
+
 /*
  * Perform best-effort check that the locale is a valid one.
  */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 208ddc9b30..4814c1c405 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2229,6 +2229,78 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Convert to canonical BCP47 language tag. Must be consistent with
+ * icu_language_tag().
+ */
+static char *
+icu_language_tag(const char *loc_str)
+{
+#ifdef USE_ICU
+	UErrorCode	 status;
+	char		 lang[ULOC_LANG_CAPACITY];
+	char		*langtag;
+	size_t		 buflen = 32;	/* arbitrary starting buffer size */
+	const bool	 strict = true;
+
+	status = U_ZERO_ERROR;
+	uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
+	if (U_FAILURE(status))
+	{
+		pg_fatal("could not get language from locale \"%s\": %s",
+				 loc_str, u_errorName(status));
+		return NULL;
+	}
+
+	/* C/POSIX locales aren't handled by uloc_getLanguageTag() */
+	if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+		return pstrdup("en-US-u-va-posix");
+
+	/*
+	 * A BCP47 language tag doesn't have a clearly-defined upper limit
+	 * (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
+	 * uloc_toLanguageTag() doesn't always return the ultimate length on the
+	 * first call, necessitating a loop.
+	 */
+	langtag = pg_malloc(buflen);
+	while (true)
+	{
+		int32_t		len;
+
+		status = U_ZERO_ERROR;
+		len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+		/*
+		 * If the result fits in the buffer exactly (len == buflen),
+		 * uloc_toLanguageTag() will return success without nul-terminating
+		 * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
+		 * buflen and try again.
+		 */
+		if (status == U_BUFFER_OVERFLOW_ERROR ||
+			(U_SUCCESS(status) && len >= buflen))
+		{
+			buflen = buflen * 2;
+			langtag = pg_realloc(langtag, buflen);
+			continue;
+		}
+
+		break;
+	}
+
+	if (U_FAILURE(status))
+	{
+		pg_free(langtag);
+
+		pg_fatal("could not convert locale name \"%s\" to language tag: %s",
+				 loc_str, u_errorName(status));
+	}
+
+	return langtag;
+#else
+	pg_fatal("ICU is not supported in this build");
+#endif
+}
+
 /*
  * Perform best-effort check that the locale is a valid one. Should be
  * consistent with pg_locale.c, except that it doesn't need to open the
@@ -2376,6 +2448,8 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
+		char *langtag;
+
 		/* acquire default locale from the environment, if not specified */
 		if (icu_locale == NULL)
 		{
@@ -2383,6 +2457,13 @@ setlocales(void)
 			printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
 		}
 
+		/* canonicalize to a language tag */
+		langtag = icu_language_tag(icu_locale);
+		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+			   langtag, icu_locale);
+		pg_free(icu_locale);
+		icu_locale = langtag;
+
 		icu_validate_locale(icu_locale);
 
 		/*
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index db7995fe28..17a444d80c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -144,7 +144,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu',
 			'--icu-locale=@colNumeric=lower', "$tempdir/dataX"
 		],
-		qr/could not open collator for locale "\@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR/,
+		qr/could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR/,
 		'fails for invalid collation argument');
 }
 else
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 42215f82f7..df26ba42d6 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1860,9 +1860,9 @@ my %tests = (
 
 	'CREATE COLLATION icu_collation' => {
 		create_order => 76,
-		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'C');",
+		create_sql   => "CREATE COLLATION icu_collation (PROVIDER = icu, LOCALE = 'en-US-u-va-posix');",
 		regexp =>
-		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'C'(, version = '[^']*')?\);/m,
+		  qr/CREATE COLLATION public.icu_collation \(provider = icu, locale = 'en-US-u-va-posix'(, version = '[^']*')?\);/m,
 		icu => 1,
 		like      => { %full_runs, section_pre_data => 1, },
 	},
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index c275427976..8c095abc52 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -120,6 +120,7 @@ extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
 extern void icu_validate_locale(const char *loc_str);
+extern char *icu_language_tag(const char *loc_str, int elevel);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 5e480d45cd..b5a221b030 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1019,6 +1019,7 @@ reset enable_seqscan;
 CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1033,12 +1034,17 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+RESET client_min_messages;
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 ERROR:  parameter "locale" must be specified
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
 ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
@@ -1169,14 +1175,18 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test ICU collation customization
 -- test the attributes handled by icu_set_collation_attributes()
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
  ?column? | ?column? 
 ----------+----------
@@ -1184,7 +1194,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
 (1 row)
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
+NOTICE:  using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
 CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
+NOTICE:  using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
 SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
  ?column? | ?column? 
 ----------+----------
@@ -1192,13 +1204,16 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 (1 row)
 
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
+NOTICE:  using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
  ?column? | ?column? 
 ----------+----------
  t        | t
 (1 row)
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
  ?column? | ?column? 
 ----------+----------
@@ -1206,10 +1221,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
 (1 row)
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-ERROR:  could not open collator for locale "@colNumeric=lower": U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+ERROR:  could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
 -- test that attributes not handled by icu_set_collation_attributes()
 -- (handled by ucol_open() directly) also work
 CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
+NOTICE:  using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
 SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
  ?column? | ?column? 
 ----------+----------
@@ -1218,6 +1235,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
 
 -- rules
 CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
+NOTICE:  using standard form "und" for locale ""
 CREATE TABLE test7 (a text);
 -- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
 INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1245,10 +1263,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
 
 DROP TABLE test7;
 CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-ERROR:  could not open collator for locale "" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
+NOTICE:  using standard form "und" for locale ""
+ERROR:  could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
 -- nondeterministic collations
 CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
+NOTICE:  using standard form "und" for locale ""
 CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
+NOTICE:  using standard form "und" for locale ""
 CREATE TABLE test6 (a int, b text);
 -- same string in different normal forms
 INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1298,7 +1319,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
 (2 rows)
 
 CREATE COLLATION case_sensitive (provider = icu, locale = '');
+NOTICE:  using standard form "und" for locale ""
 CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
+NOTICE:  using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
 SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
  ?column? | ?column? 
 ----------+----------
@@ -1780,7 +1803,9 @@ SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 (2 rows)
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
 SELECT * FROM test4 WHERE b = 'cote';
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 95d96f2eb8..85e26951b6 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -357,6 +357,8 @@ CREATE ROLE regress_test_role;
 CREATE SCHEMA test_schema;
 
 -- We need to do this this way to cope with varying names for encodings:
+SET client_min_messages TO WARNING;
+
 do $$
 BEGIN
   EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -370,9 +372,14 @@ BEGIN
           quote_literal(current_setting('lc_collate')) || ');';
 END
 $$;
+
+RESET client_min_messages;
+
 CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 SET icu_validation_level = WARNING;
+CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 RESET icu_validation_level;
 
@@ -457,10 +464,14 @@ SELECT * FROM collate_test2 ORDER BY b COLLATE UNICODE;
 
 -- test the attributes handled by icu_set_collation_attributes()
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes');
+RESET client_min_messages;
 SELECT 'aaá' > 'AAA' COLLATE "und-x-icu", 'aaá' < 'AAA' COLLATE testcoll_ignore_accents;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_backwards (provider = icu, locale = '@colBackwards=yes');
+RESET client_min_messages;
 SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll_backwards;
 
 CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
@@ -470,7 +481,9 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
 CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
 SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
 
+SET client_min_messages=WARNING;
 CREATE COLLATION testcoll_numeric (provider = icu, locale = '@colNumeric=yes');
+RESET client_min_messages;
 SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_numeric;
 
 CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
@@ -659,7 +672,9 @@ INSERT INTO inner_text VALUES ('a', NULL);
 SELECT * FROM outer_text WHERE (f1, f2) NOT IN (SELECT * FROM inner_text);
 
 -- accents
+SET client_min_messages=WARNING;
 CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false);
+RESET client_min_messages;
 
 CREATE TABLE test4 (a int, b text);
 INSERT INTO test4 VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 'côté');
-- 
2.34.1

#41

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#40)

Re: ICU locale validation / canonicalization

On 31.03.23 12:11, Jeff Davis wrote:

On Thu, 2023-03-30 at 08:59 +0200, Peter Eisentraut wrote:

I don't think we should show the notice when the canonicalization
doesn't change anything. This is not useful:

+NOTICE: using language tag "und-u-kf-upper" for locale "und-u-kf-
upper"

Done.

Also, the message should be phrased more from the perspective of the
user instead of using ICU jargon, like

NOTICE: using canonicalized form "%s" for locale specification "%s"

(Still too many big words?)

Changed to:

NOTICE: using standard form "%s" for locale "%s"

Needs documentation updates in doc/src/sgml/charset.sgml.

I made a very minor update. Do you have something more specific in
mind?

This all looks good to me.

#42

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Peter Eisentraut (#41)

Re: ICU locale validation / canonicalization

MSVC now says this on master:

[17:48:12.446] c:\cirrus\src\backend\utils\adt\pg_locale.c(2912) :
warning C4715: 'icu_language_tag': not all control paths return a
value

CI doesn't currently fail for MSVC warnings, so it's a bit hidden.
FWIW cfbot does show this with a ⚠ sign with its new system for
grovelling through logs, which will now show up on every entry now
that this warning is in master.

#43

Noah Misch

noah@leadboat.com

over 2 years ago

In reply to: Peter Eisentraut (#38)

Re: ICU locale validation / canonicalization

On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:

On 30.03.23 04:33, Jeff Davis wrote:

Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it still
seems like a good idea to get the locales into a standard format in the
catalogs, and if a lot more people start using ICU in v16 (because it's
the default), then it would be a good time to do it. But perhaps there
are risks?

I say, let's do it.

The following is not cause for postgresql.git changes at this time, but I'm
sharing it in case it saves someone else the study effort. Commit ea1db8a
("Canonicalize ICU locale names to language tags.") slowed buildfarm member
hoverfly, but that disappears if I drop debug_parallel_query from its config.
Typical end-to-end duration rose from 2h5m to 2h55m. Most-affected were
installcheck runs, which rose from 11m to 19m. (The "check" stage uses
NO_LOCALE=1, so it changed less.) From profiles, my theory is that each of
the many parallel workers burns notable CPU and I/O opening its ICU collator
for the first time. debug_parallel_query, by design, pursues parallelism
independent of cost, so this is working as intended. If it ever matters in
non-debug configurations, we might raise the default parallel_setup_cost or
pre-load ICU collators in the postmaster.

#44

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Noah Misch (#43)

Re: ICU locale validation / canonicalization

On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote:

On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:

On 30.03.23 04:33, Jeff Davis wrote:

Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it
still
seems like a good idea to get the locales into a standard format
in the
catalogs, and if a lot more people start using ICU in v16
(because it's
the default), then it would be a good time to do it. But perhaps
there
are risks?

I say, let's do it.

The following is not cause for postgresql.git changes at this time,
but I'm
sharing it in case it saves someone else the study effort. Commit
ea1db8a
("Canonicalize ICU locale names to language tags.") slowed buildfarm
member
hoverfly, but that disappears if I drop debug_parallel_query from its
config.
Typical end-to-end duration rose from 2h5m to 2h55m. Most-affected
were
installcheck runs, which rose from 11m to 19m. (The "check" stage
uses
NO_LOCALE=1, so it changed less.) From profiles, my theory is that
each of
the many parallel workers burns notable CPU and I/O opening its ICU
collator
for the first time.

I didn't repro the overall test timings (mine is ~1m40s compared to
~11-19m on hoverfly) but I think a microbenchmark on the ICU calls
showed a possible cause.

I ran open in a loop 10M times on the requested locale. The root locale
("und"[1]It appears that "und" is also slow to open in ICU < 64. Hoverfly is on v58, so it's possible that's the problem if daticulocale=und., "root" and "") take about 1.3s to open 10M times; simple
locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at
3.3s.

Unrecognized locales like "xyz" take about 10 times as long: 13s to
open 10M times, presumably to perform the fallback logic that
ultimately opens the root locale. Not sure if 10X slower in the open
path is enough to explain the overall test slowdown.

My guess is that the ICU locale for these tests is not recognized, or
is some other locale that opens slowly. Can you tell me the actual
daticulocale?

Regards,
Jeff Davis

[1]: It appears that "und" is also slow to open in ICU < 64. Hoverfly is on v58, so it's possible that's the problem if daticulocale=und.
on v58, so it's possible that's the problem if daticulocale=und.

#45

Noah Misch

noah@leadboat.com

over 2 years ago

In reply to: Jeff Davis (#44)

Re: ICU locale validation / canonicalization

On Sat, May 20, 2023 at 10:19:30AM -0700, Jeff Davis wrote:

On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote:

On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:

On 30.03.23 04:33, Jeff Davis wrote:

Attached is a new version of the final patch, which performs
canonicalization. I'm not 100% sure that it's wanted, but it
still
seems like a good idea to get the locales into a standard format
in the
catalogs, and if a lot more people start using ICU in v16
(because it's
the default), then it would be a good time to do it. But perhaps
there
are risks?

I say, let's do it.

The following is not cause for postgresql.git changes at this time,
but I'm
sharing it in case it saves someone else the study effort.ï¿½ Commit
ea1db8a
("Canonicalize ICU locale names to language tags.") slowed buildfarm
member
hoverfly, but that disappears if I drop debug_parallel_query from its
config.
Typical end-to-end duration rose from 2h5m to 2h55m.ï¿½ Most-affected
were
installcheck runs, which rose from 11m to 19m.ï¿½ (The "check" stage
uses
NO_LOCALE=1, so it changed less.)ï¿½ From profiles, my theory is that
each of
the many parallel workers burns notable CPU and I/O opening its ICU
collator
for the first time.

I didn't repro the overall test timings (mine is ~1m40s compared to
~11-19m on hoverfly) but I think a microbenchmark on the ICU calls
showed a possible cause.

I ran open in a loop 10M times on the requested locale. The root locale
("und"[1], "root" and "") take about 1.3s to open 10M times; simple
locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at
3.3s.

Unrecognized locales like "xyz" take about 10 times as long: 13s to
open 10M times, presumably to perform the fallback logic that
ultimately opens the root locale. Not sure if 10X slower in the open
path is enough to explain the overall test slowdown.

My guess is that the ICU locale for these tests is not recognized, or
is some other locale that opens slowly. Can you tell me the actual
daticulocale?

As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-posix. Check
got daticulocale=NULL.

(The machine in question was unusable for PostgreSQL from 2023-05-12 to
2023-06-30, due to https://stackoverflow.com/q/76369660/16371536. That
delayed my response.)

#46

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Noah Misch (#45)

Re: ICU locale validation / canonicalization

On Sat, 2023-07-01 at 10:31 -0700, Noah Misch wrote:

As of commit b8c3f6d, InstallCheck-C got daticulocale=en-US-u-va-
posix. Check
got daticulocale=NULL.

With the same test setup, that locale takes about 8.6 seconds (opening
it 10M times), about 2.5X slower than "en-US" and about 7X slower than
"und". I think that explains it.

The locale "en-US-u-va-posix" normally happens when passing a locale
beginning with "C" to ICU. After 2535c74b1a we don't get ICU locales
from the environment anywhere, so that should be rare (and probably
indicates a user mistake). I don't think this is a practical problem
any more.

Regards,
Jeff Davis