pg_collation.collversion for C.UTF-8

Started by Daniel Veriteover 2 years ago22 messages

daniel@manitou-mail.org

over 2 years ago

1 attachment(s)

Hi,

get_collation_actual_version() in pg_locale.c currently
excludes C.UTF-8 (and more generally C.*) from versioning,
which makes pg_collation.collversion being empty for these
collations.

char *
get_collation_actual_version(char collprovider, const char *collcollate)
{
....
if (collprovider == COLLPROVIDER_LIBC &&
pg_strcasecmp("C", collcollate) != 0 &&
pg_strncasecmp("C.", collcollate, 2) != 0 &&
pg_strcasecmp("POSIX", collcollate) != 0)

This seems to be based on the idea that C.* collations provide an
immutable sort like "C", but it appears that it's not the case.

For instance, consider how these C.UTF-8 comparisons differ between
recent linux systems:

U+1D400 = Mathematical Bold Capital A

Debian 9.13 (glibc 2.24)
=> select 'A' < E'\U0001D400' collate "C.UTF-8";
?column?
----------
t

Debian 10.13 (glibc 2.28)
=> select 'A' < E'\U0001D400' collate "C.UTF-8";
?column?
----------
f

Debian 11.6 (glibc 2.31)
=> select 'A' < E'\U0001D400' collate "C.UTF-8";
?column?
----------
f

Ubuntu 22.04 (glibc 2.35)
=> select 'A' < E'\U0001D400' collate "C.UTF-8";
?column?
----------
t

So I suggest the attached patch to no longer exclude these collations
from the generic versioning.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

versioning-c-dot-something-collations.patchtext/plainDownload

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 092b620673..dd9c3a0c10 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1727,7 +1727,6 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 #endif
 		if (collprovider == COLLPROVIDER_LIBC &&
 			pg_strcasecmp("C", collcollate) != 0 &&
-			pg_strncasecmp("C.", collcollate, 2) != 0 &&
 			pg_strcasecmp("POSIX", collcollate) != 0)
 	{
 #if defined(__GLIBC__)

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Daniel Verite (#1)

Re: pg_collation.collversion for C.UTF-8

On Wed, Apr 19, 2023 at 12:36 AM Daniel Verite <daniel@manitou-mail.org> wrote:

This seems to be based on the idea that C.* collations provide an
immutable sort like "C", but it appears that it's not the case.

Hmm. It seems I added that exemption initially for FreeBSD only in
ca051d8b101, and then merged the cases for several OSes in
beb4480c853.

It's extremely surprising to me that the sort order changed. I
expected the sort order to be code point order:

https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

One interesting thing is that it seems that it might have been
independently invented by Debian (?) and then harmonised with glibc
2.35:

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1871363.html

Was the earlier Debian version buggy, or did it simply have a
different idea of what the sort order should be, intentionally? Ugh.
From your examples, we can see that the older Debian system did not
have A < [some 4 digit code point], while the later version did (as
expected). If so then it might be tempting to *not* do what you're
suggesting, since the stated goal of the thing is to be stable from
now on. But it changed once in the early years of its existence!
Annoying.

Many OSes have a locale with this name. I don't know this history,
who did it first etc, but now I am wondering if they all took the
"obvious" interpretation, that it should be code-point based,
extrapolating from "C" (really memcmp order):

https://unix.stackexchange.com/questions/597962/how-widespread-is-the-c-utf-8-locale

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Thomas Munro (#2)

Re: pg_collation.collversion for C.UTF-8

On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote:

Many OSes have a locale with this name. I don't know this history,
who did it first etc, but now I am wondering if they all took the
"obvious" interpretation, that it should be code-point based,
extrapolating from "C" (really memcmp order):

memcmp() is not the same as code-point order in all encodings, right?

I've been thinking that we should have a "provider=none" for the
special cases that use memcmp(). It's not using libc as a collation
provider; it's really postgres in control of the semantics.

That would clean up the documentation and the code a bit, and make it
more clear which locales are being passed to the provider and which
ones aren't.

If we are passing it to a provider (e.g. "C.UTF-8"), we shouldn't make
unnecessary assumptions about what the provider will do with it.

For what it's worth, in my recent ICU language tag work, I special-
cased ICU locales with language "C" or "POSIX" to map to "en-US-u-va-
posix", disregarding everything else (collation attributes, etc.). I
believe that's the right thing based on the behavior I observed: for
the POSIX variant of en-US, ICU seems to disregard other things such as
case insensitivity. But it still ultimately goes to the provider and
ICU has particular rules for that locale -- I don't assume memcpy-like
semantics or code point order.

Regards,
Jeff Davis

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Jeff Davis (#3)

Re: pg_collation.collversion for C.UTF-8

On Wed, Apr 19, 2023 at 1:30 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote:

Many OSes have a locale with this name. I don't know this history,
who did it first etc, but now I am wondering if they all took the
"obvious" interpretation, that it should be code-point based,
extrapolating from "C" (really memcmp order):

memcmp() is not the same as code-point order in all encodings, right?

Right. I wasn't trying to suggest that *we* should assume that, I was
just thinking out loud about how a libc implementor would surely think
that a "C.encoding" should work, in the spirit of "C", given that the
standard doesn't tell us IIUC. It looks like for technical reasons
inside glibc, that couldn't be done before 2.35:

https://sourceware.org/bugzilla/show_bug.cgi?id=17318

That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied by
the glibc project) isn't supposed to be versioned, but it's extremely
unfortunate that a bunch of OSes (Debian and maybe more) have been
sorting text in some other order under that name for years.

I've been thinking that we should have a "provider=none" for the
special cases that use memcmp(). It's not using libc as a collation
provider; it's really postgres in control of the semantics.

Yeah, interesting idea.

Daniel Verite

daniel@manitou-mail.org

over 2 years ago

In reply to: Thomas Munro (#4)

Re: pg_collation.collversion for C.UTF-8

Thomas Munro wrote:

It looks like for technical reasons
inside glibc, that couldn't be done before 2.35:

https://sourceware.org/bugzilla/show_bug.cgi?id=17318

That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
by the glibc project) isn't supposed to be versioned, but it's
extremely unfortunate that a bunch of OSes (Debian and maybe more)
have been sorting text in some other order under that name for
years.

Yes. This is consistent with Debian/Ubuntu patches in
glibc/localedata/locales/C

glibc-2.35 is not patched, and upstream has this:
LC_COLLATE
% The keyword 'codepoint_collation' in any part of any LC_COLLATE
% immediately discards all collation information and causes the
% locale to use strcmp/wcscmp for collation comparison. This is
% exactly what is needed for C (ASCII) or C.UTF-8.
codepoint_collation
END LC_COLLATE

But in older versions, glibc doesn't have the locales/C data file.
Debian adds it in debian/patches/localedata/C with that kind of
content:

* glibc 2.31 Debian 11
LC_COLLATE
order_start forward
<U0000>
..
<U007F>
<U0080>
..
<U00FF>
etc...

But as explained in the above-linked bugzilla entry, that did not
result in true byte-comparison semantics, for several reasons
that got fixed in 2.35.

So this looks like a solved problem for anyone starting to use these
collation with glibc 2.35 or newer (or other OSes that don't have a
compatibility issue with them in the first place).
But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not
be having the normal warning about the need to reindex.

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Thomas Munro (#4)

Re: pg_collation.collversion for C.UTF-8

On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote:

That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
by
the glibc project) isn't supposed to be versioned, but it's extremely
unfortunate that a bunch of OSes (Debian and maybe more) have been
sorting text in some other order under that name for years.

What should we do with locales like C.UTF-8 in both libc and ICU?

We either need to capture it and use the memcmp/pg_ascii code paths so
it doesn't use the provider at all (like C); or if we send it to the
provider, we can't have too many expectations about what will be done
with it (even if we know what "should" happen).

If we capture it like the C locale, then where do we draw the line? Any
locale that begins with "C."? What if the language part is C but there
is some other part to the locale? What about lower case? Should all of
these apply the same way except with POSIX? What about backwards
compatibility?

If we pass it to the provider:

* ICU: Recent versions of ICU don't recognize C.UTF-8 at all, and if
you try to open it, you'll get the root collator (with warning or
error, which is not great for such a common locale name). ICU versions
63 and earlier recognize C.UTF-8 as en-US-u-va-posix (a.k.a.
en_US_POSIX), which has some adjustments to match expectations of C
sorting (e.g. upper case first).

* libc: problems as raised in this thread.

Regards,
Jeff Davis

Tom Lane

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Jeff Davis (#6)

Re: pg_collation.collversion for C.UTF-8

Jeff Davis <pgsql@j-davis.com> writes:

What should we do with locales like C.UTF-8 in both libc and ICU?

I vote for passing those to the existing C-specific code paths,
whereever we have any (not sure that we do for <ctype.h> functionality).
The semantics are quite well-defined and I can see no good coming of
allowing either provider to mess with them.

If we capture it like the C locale, then where do we draw the line? Any
locale that begins with "C."? What if the language part is C but there
is some other part to the locale? What about lower case? Should all of
these apply the same way except with POSIX? What about backwards
compatibility?

Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
Case-independent might be good, but we haven't accepted such in
the past, so I don't feel strongly about it. (Arguably, passing
lower case "c" to the provider would provide an "out" to anybody
who dislikes our choices here.)

regards, tom lane

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Tom Lane (#7)

2 attachment(s)

Re: pg_collation.collversion for C.UTF-8

On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

What should we do with locales like C.UTF-8 in both libc and ICU?

I vote for passing those to the existing C-specific code paths,

Great, this would be a big step toward solving the ICU usability issues
in this thread:

/messages/by-id/000b01d97465$c34bbd60$49e33820$@pcorp.us

Probably "C", or "C.anything", or "POSIX", or "POSIX.anything".
Case-independent might be good, but we haven't accepted such in
the past, so I don't feel strongly about it. (Arguably, passing
lower case "c" to the provider would provide an "out" to anybody
who dislikes our choices here.)

Patch attached with your suggestions. It's based on the first patch in
the series I posted here:

/messages/by-id/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com

We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:

For ICU: iculocale=C in an earlier version was a mistake that must
have been explicitly requested by the user. However, if such a mistake
was made, the indexes would have been created using the ICU root
locale, which is very different from the C locale. So reinterpreting
iculocale=C as memcmp() would be likely to result in index corruption.
Patch 0002 (also based on a patch from the series linked above) solves
this with a pg_upgrade check for iculocale=C in versions 15 and
earlier. The upgrade check is not likely to affect many users, and
those it does affect have a mis-defined collation and would benefit
from the check.

For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even if
there is a difference; perhaps they don't often occur in practice? It's
not clear to me how we mitigate this risk further, though.

Regards,
Jeff Davis

Attachments:

0001-Interpret-C-locales-consistently-between-ICU-and-lib.patchtext/x-patch; charset=UTF-8; name=0001-Interpret-C-locales-consistently-between-ICU-and-lib.patchDownload

From c6721272200c14931ad757185a3aaeb615c432ed Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 24 Apr 2023 15:46:17 -0700
Subject: [PATCH 1/2] Interpret C locales consistently between ICU and libc.

Treat a locale named C, C.anything, POSIX, or POSIX.anything as
equivalent to the C locale; implemented with built-in semantics
(memcmp() for collation and pg_ascii_*() for ctype).

Such locales are not passed to the provider at all, so have identical
behavior regardless of whether it's declared with provider ICU or
libc.

Previously, only C and POSIX locales had this behavior (not
e.g. "C.UTF-8"), and only if the provider was declared as libc. That
caused problems on libc for locales like C.UTF-8, which may have
subtly different behavior in some versions of libc; and it caused
problems on ICU because newer versions don't recognize C locales.

Discussion: https://postgr.es/m/1559006.1685040536@sss.pgh.pa.us
Discussion: https://postgr.es/m/c840107b-4cb9-c8e9-abb7-1d8c5e0d51df%40enterprisedb.com
Discussion: https://postgr.es/m/87v8hoexdv.fsf@news-spur.riddles.org.uk
---
 doc/src/sgml/charset.sgml                     |   3 +-
 src/backend/commands/collationcmds.c          |  42 +++---
 src/backend/commands/dbcommands.c             |  41 +++---
 src/backend/utils/adt/pg_locale.c             | 126 +++++++++++++-----
 src/backend/utils/init/postinit.c             |   4 +-
 src/backend/utils/mb/mbutils.c                |   3 +-
 src/include/utils/pg_locale.h                 |   1 +
 .../regress/expected/collate.icu.utf8.out     |   6 +
 src/test/regress/expected/collate.out         |   5 +
 src/test/regress/sql/collate.icu.utf8.sql     |   4 +
 src/test/regress/sql/collate.sql              |   5 +
 11 files changed, 167 insertions(+), 73 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index ed84465996..8ba3117557 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -136,7 +136,8 @@ initdb --locale=sv_SE
    <para>
     If you want the system to behave as if it had no locale support,
     use the special locale name <literal>C</literal>, or equivalently
-    <literal>POSIX</literal>.
+    <literal>POSIX</literal>. An encoding may also be appended, for
+    example <literal>C.UTF-8</literal>.
    </para>
 
    <para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 2969a2bb21..a451ae8843 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -264,26 +264,38 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 						(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 						 errmsg("parameter \"locale\" must be specified")));
 
-			/*
-			 * During binary upgrade, preserve the locale string. Otherwise,
-			 * canonicalize to a language tag.
-			 */
-			if (!IsBinaryUpgrade)
+			if (locale_name_is_c(colliculocale))
 			{
-				char	   *langtag = icu_language_tag(colliculocale,
-													   icu_validation_level);
-
-				if (langtag && strcmp(colliculocale, langtag) != 0)
+				if (!collisdeterministic)
+					ereport(ERROR,
+							(errmsg("nondeterministic collations not supported for C or POSIX locale")));
+				if (collicurules != NULL)
+					ereport(ERROR,
+							(errmsg("RULES not supported for C or POSIX locale")));
+			}
+			else
+			{
+				/*
+				 * During binary upgrade, preserve the locale
+				 * string. Otherwise, canonicalize to a language tag.
+				 */
+				if (!IsBinaryUpgrade)
 				{
-					ereport(NOTICE,
-							(errmsg("using standard form \"%s\" for locale \"%s\"",
-									langtag, colliculocale)));
+					char	   *langtag = icu_language_tag(colliculocale,
+														   icu_validation_level);
+
+					if (langtag && strcmp(colliculocale, langtag) != 0)
+					{
+						ereport(NOTICE,
+								(errmsg("using standard form \"%s\" for locale \"%s\"",
+										langtag, colliculocale)));
 
-					colliculocale = langtag;
+						colliculocale = langtag;
+					}
 				}
-			}
 
-			icu_validate_locale(colliculocale);
+				icu_validate_locale(colliculocale);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 99d4080ea9..601a08ef11 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,27 +1058,36 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 					 errmsg("ICU locale must be specified")));
 
-		/*
-		 * During binary upgrade, or when the locale came from the template
-		 * database, preserve locale string. Otherwise, canonicalize to a
-		 * language tag.
-		 */
-		if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+		if (locale_name_is_c(dbiculocale))
 		{
-			char	   *langtag = icu_language_tag(dbiculocale,
-												   icu_validation_level);
-
-			if (langtag && strcmp(dbiculocale, langtag) != 0)
+			if (dbicurules != NULL)
+				ereport(ERROR,
+						(errmsg("ICU_RULES not supported for C or POSIX locale")));
+		}
+		else
+		{
+			/*
+			 * During binary upgrade, or when the locale came from the
+			 * template database, preserve locale string. Otherwise,
+			 * canonicalize to a language tag.
+			 */
+			if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
 			{
-				ereport(NOTICE,
-						(errmsg("using standard form \"%s\" for locale \"%s\"",
-								langtag, dbiculocale)));
+				char	   *langtag = icu_language_tag(dbiculocale,
+													   icu_validation_level);
+
+				if (langtag && strcmp(dbiculocale, langtag) != 0)
+				{
+					ereport(NOTICE,
+							(errmsg("using standard form \"%s\" for locale \"%s\"",
+									langtag, dbiculocale)));
 
-				dbiculocale = langtag;
+					dbiculocale = langtag;
+				}
 			}
-		}
 
-		icu_validate_locale(dbiculocale);
+			icu_validate_locale(dbiculocale);
+		}
 	}
 	else
 	{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 31e3b16ae0..2f2734a405 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1239,15 +1239,19 @@ lookup_collation_cache(Oid collation, bool set_flags)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_collctype);
 			collctype = TextDatumGetCString(datum);
 
-			cache_entry->collate_is_c = ((strcmp(collcollate, "C") == 0) ||
-										 (strcmp(collcollate, "POSIX") == 0));
-			cache_entry->ctype_is_c = ((strcmp(collctype, "C") == 0) ||
-									   (strcmp(collctype, "POSIX") == 0));
+			cache_entry->collate_is_c = locale_name_is_c(collcollate);
+			cache_entry->ctype_is_c = locale_name_is_c(collctype);
 		}
 		else
 		{
-			cache_entry->collate_is_c = false;
-			cache_entry->ctype_is_c = false;
+			Datum		datum;
+			const char *colliculocale;
+
+			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
+			colliculocale = TextDatumGetCString(datum);
+
+			cache_entry->collate_is_c = locale_name_is_c(colliculocale);
+			cache_entry->ctype_is_c = cache_entry->collate_is_c;
 		}
 
 		cache_entry->flags_valid = true;
@@ -1258,6 +1262,22 @@ lookup_collation_cache(Oid collation, bool set_flags)
 	return cache_entry;
 }
 
+/*
+ * Check if the locale name should be handled like the C locale.
+ *
+ * If so, the locale should be handled with built-in memcmp() and
+ * pg_ascii_*(); otherwise, the locale should be handled by the collation
+ * provider.
+ */
+bool
+locale_name_is_c(const char *locale)
+{
+	if (strcmp(locale, "C") == 0 || strncmp(locale, "C.", 2) == 0 ||
+		strcmp(locale, "POSIX") == 0 || strncmp(locale, "POSIX.", 6) == 0)
+		return true;
+
+	return false;
+}
 
 /*
  * Detect whether collation's LC_COLLATE property is C
@@ -1279,23 +1299,30 @@ lc_collate_is_c(Oid collation)
 	if (collation == DEFAULT_COLLATION_OID)
 	{
 		static int	result = -1;
-		char	   *localeptr;
-
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return false;
+		const char *localeptr;
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_COLLATE, NULL);
-		if (!localeptr)
-			elog(ERROR, "invalid LC_COLLATE setting");
-
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
+
+		if (default_locale.provider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+			localeptr = default_locale.info.icu.locale;
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
+		}
 		else
-			result = false;
+		{
+			localeptr = setlocale(LC_COLLATE, NULL);
+			if (!localeptr)
+				elog(ERROR, "invalid LC_COLLATE setting");
+		}
+
+		result = locale_name_is_c(localeptr);
+
 		return (bool) result;
 	}
 
@@ -1332,23 +1359,30 @@ lc_ctype_is_c(Oid collation)
 	if (collation == DEFAULT_COLLATION_OID)
 	{
 		static int	result = -1;
-		char	   *localeptr;
-
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return false;
+		const char *localeptr;
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_CTYPE, NULL);
-		if (!localeptr)
-			elog(ERROR, "invalid LC_CTYPE setting");
-
-		if (strcmp(localeptr, "C") == 0)
-			result = true;
-		else if (strcmp(localeptr, "POSIX") == 0)
-			result = true;
+
+		if (default_locale.provider == COLLPROVIDER_ICU)
+		{
+#ifdef USE_ICU
+			localeptr = default_locale.info.icu.locale;
+#else
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("ICU is not supported in this build")));
+#endif
+		}
 		else
-			result = false;
+		{
+			localeptr = setlocale(LC_CTYPE, NULL);
+			if (!localeptr)
+				elog(ERROR, "invalid LC_CTYPE setting");
+		}
+
+		result = locale_name_is_c(localeptr);
+
 		return (bool) result;
 	}
 
@@ -1375,7 +1409,13 @@ make_icu_collator(const char *iculocstr,
 #ifdef USE_ICU
 	UCollator  *collator;
 
-	collator = pg_ucol_open(iculocstr);
+	if (locale_name_is_c(iculocstr))
+	{
+		Assert(icurules == NULL);
+		collator = NULL;
+	}
+	else
+		collator = pg_ucol_open(iculocstr);
 
 	/*
 	 * If rules are specified, we extract the rules of the standard collation,
@@ -1525,6 +1565,9 @@ pg_newlocale_from_collation(Oid collid)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_collctype);
 			collctype = TextDatumGetCString(datum);
 
+			Assert(!locale_name_is_c(collcollate));
+			Assert(!locale_name_is_c(collctype));
+
 			if (strcmp(collcollate, collctype) == 0)
 			{
 				/* Normal case where they're the same */
@@ -1581,6 +1624,8 @@ pg_newlocale_from_collation(Oid collid)
 			datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
 			iculocstr = TextDatumGetCString(datum);
 
+			Assert(!locale_name_is_c(iculocstr));
+
 			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collicurules, &isnull);
 			if (!isnull)
 				icurules = TextDatumGetCString(datum);
@@ -1650,6 +1695,9 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 {
 	char	   *collversion = NULL;
 
+	if (locale_name_is_c(collcollate))
+		return NULL;
+
 #ifdef USE_ICU
 	if (collprovider == COLLPROVIDER_ICU)
 	{
@@ -1667,10 +1715,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	}
 	else
 #endif
-		if (collprovider == COLLPROVIDER_LIBC &&
-			pg_strcasecmp("C", collcollate) != 0 &&
-			pg_strncasecmp("C.", collcollate, 2) != 0 &&
-			pg_strcasecmp("POSIX", collcollate) != 0)
+		if (collprovider == COLLPROVIDER_LIBC)
 	{
 #if defined(__GLIBC__)
 		/* Use the glibc version because we don't have anything better. */
@@ -2457,6 +2502,13 @@ pg_ucol_open(const char *loc_str)
 	if (loc_str == NULL)
 		elog(ERROR, "opening default collator is not supported");
 
+	/*
+	 * Must never open special values C or POSIX, which are treated specially
+	 * and not passed to the provider.
+	 */
+	if (locale_name_is_c(loc_str))
+		elog(ERROR, "unexpected ICU locale string: %s", loc_str);
+
 	/*
 	 * In ICU versions 54 and earlier, "und" is not a recognized spelling of
 	 * the root locale. If the first component of the locale is "und", replace
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6856ed99e7..92928133c0 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -419,9 +419,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 						   " which is not recognized by setlocale().", ctype),
 				 errhint("Recreate the database with another locale or install the missing locale.")));
 
-	if (strcmp(ctype, "C") == 0 ||
-		strcmp(ctype, "POSIX") == 0)
-		database_ctype_is_c = true;
+	database_ctype_is_c = locale_name_is_c(ctype);
 
 	if (dbform->datlocprovider == COLLPROVIDER_ICU)
 	{
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index 67a1ab2ab2..9a54a952e0 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -39,6 +39,7 @@
 #include "mb/pg_wchar.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/pg_locale.h"
 #include "utils/syscache.h"
 #include "varatt.h"
 
@@ -1239,7 +1240,7 @@ pg_bind_textdomain_codeset(const char *domainname)
 #ifndef WIN32
 	const char *ctype = setlocale(LC_CTYPE, NULL);
 
-	if (pg_strcasecmp(ctype, "C") == 0 || pg_strcasecmp(ctype, "POSIX") == 0)
+	if (locale_name_is_c(ctype))
 #endif
 		if (encoding != PG_SQL_ASCII &&
 			raw_pg_bind_textdomain_codeset(domainname, encoding))
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e2a7243542..0e26346546 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT bool database_ctype_is_c;
 extern bool check_locale(int category, const char *locale, char **canonname);
 extern char *pg_perm_setlocale(int category, const char *locale);
 
+extern bool locale_name_is_c(const char *locale);
 extern bool lc_collate_is_c(Oid collation);
 extern bool lc_ctype_is_c(Oid collation);
 
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index c658ee1404..79ce33abbd 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1043,12 +1043,18 @@ ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'C', deterministic = false); -- fails
+ERROR:  nondeterministic collations not supported for C or POSIX locale
+CREATE COLLATION testx (provider = icu, locale = 'C', rules = '&V << w <<< W'); -- fails
+ERROR:  RULES not supported for C or POSIX locale
 RESET icu_validation_level;
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'C.UTF-8'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'POSIX'); DROP COLLATION testx;
 CREATE COLLATION test4 FROM nonsense;
 ERROR:  collation "nonsense" for encoding "UTF8" does not exist
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0649564485..e2d0a39732 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -649,6 +649,11 @@ EXPLAIN (COSTS OFF)
    ->  Seq Scan on collate_test10
 (3 rows)
 
+-- test alternate spellings of special locale C
+CREATE COLLATION coll_c_locale ( LOCALE = "C.something" );
+DROP COLLATION coll_c_locale;
+CREATE COLLATION coll_c_locale ( LOCALE = "POSIX.something" );
+DROP COLLATION coll_c_locale;
 -- CREATE/DROP COLLATION
 CREATE COLLATION mycoll1 FROM "C";
 CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 7bd0901281..adc6b7deec 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -379,9 +379,13 @@ CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, nee
 SET icu_validation_level = ERROR;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'C', deterministic = false); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'C', rules = '&V << w <<< W'); -- fails
 RESET icu_validation_level;
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'C.UTF-8'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'POSIX'); DROP COLLATION testx;
 
 CREATE COLLATION test4 FROM nonsense;
 CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index c3d40fc195..10ff532169 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -241,6 +241,11 @@ EXPLAIN (COSTS OFF)
 EXPLAIN (COSTS OFF)
   SELECT * FROM collate_test10 ORDER BY x DESC, y COLLATE "C" ASC NULLS FIRST;
 
+-- test alternate spellings of special locale C
+CREATE COLLATION coll_c_locale ( LOCALE = "C.something" );
+DROP COLLATION coll_c_locale;
+CREATE COLLATION coll_c_locale ( LOCALE = "POSIX.something" );
+DROP COLLATION coll_c_locale;
 
 -- CREATE/DROP COLLATION
 
-- 
2.34.1

0002-pg_upgrade-check-for-ICU-locale-C-in-versions-15-and.patchtext/x-patch; charset=UTF-8; name=0002-pg_upgrade-check-for-ICU-locale-C-in-versions-15-and.patchDownload

From aaa116db0e1409b95bf8eb3612d8af9dc8de8abb Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 24 May 2023 12:15:33 -0700
Subject: [PATCH 2/2] pg_upgrade: check for ICU locale C in versions 15 and
 earlier.

ICU collations with locale "C" (and equivalently "POSIX") were not
supported before version 16, but could still be created. Users with
such collations would not get expected behavior.

Reject upgrading clusters with such collations, so that we can support
"C" and "POSIX" locales for the ICU provider without risk of
corrupting indexes during pg_upgrade.
---
 src/bin/pg_upgrade/check.c | 100 +++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 64024e3b9e..e7163c4c7e 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -26,6 +26,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_composite_data_type_usage(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_aclitem_data_type_usage(ClusterInfo *cluster);
+static void check_icu_c_before_16(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
@@ -164,6 +165,9 @@ check_and_dump_old_cluster(bool live_check)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 905)
 		check_for_pg_role_prefix(&old_cluster);
 
+	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1500)
+		check_icu_c_before_16(&old_cluster);
+
 	if (GET_MAJOR_VERSION(old_cluster.major_version) == 904 &&
 		old_cluster.controldata.cat_ver < JSONB_FORMAT_CHANGE_CAT_VER)
 		check_for_jsonb_9_4_usage(&old_cluster);
@@ -1233,6 +1237,102 @@ check_for_aclitem_data_type_usage(ClusterInfo *cluster)
 		check_ok();
 }
 
+/*
+ * check_icu_c_before_16
+ *
+ *  Version 16 adds support for the ICU C locale, but it was possible to
+ *  (incorrectly) create it in prior versions. Check for this invalid ICU
+ *  locale name, or variant spellings like "C.UTF-8", in version 15 and
+ *  earlier.
+ */
+static void
+check_icu_c_before_16(ClusterInfo *cluster)
+{
+	PGresult   *dat_res;
+	PGconn	   *template1_conn = connectToServer(cluster, "template1");
+	int			dat_ntups;
+	int			i_datname;
+	FILE	   *script = NULL;
+	char		output_path[MAXPGPATH];
+
+	prep_status("Checking for ICU collations with locale \"C\"");
+
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "icu_c_before_16.txt");
+
+	/* check pg_database */
+	dat_res = executeQueryOrDie(template1_conn,
+								"SELECT datname "
+								"FROM pg_catalog.pg_database "
+								"WHERE datlocprovider='i' "
+								"AND daticulocale ~ '^C(\\..*)?$' "
+								"OR daticulocale ~ '^POSIX(\\..*)?$'");
+
+	i_datname = PQfnumber(dat_res, "datname");
+
+	dat_ntups = PQntuples(dat_res);
+
+	for (int rowno = 0; rowno < dat_ntups; rowno++)
+	{
+		if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+			pg_fatal("could not open file \"%s\": %s",
+					 output_path, strerror(errno));
+		fprintf(script, "default collation for database %s\n",
+				PQgetvalue(dat_res, rowno, i_datname));
+	}
+
+	PQclear(dat_res);
+	PQfinish(template1_conn);
+
+	/* check pg_collation in each database */
+	for (int dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+		PGconn	   *db_conn = connectToServer(cluster, active_db->db_name);
+		PGresult   *coll_res;
+		int			coll_ntups;
+		int			i_collid;
+		int			i_collname;
+
+		coll_res = executeQueryOrDie(db_conn,
+									 "SELECT oid as collid, collname "
+									 "FROM pg_catalog.pg_collation "
+									 "WHERE collprovider='i' "
+									 "AND colliculocale ~ '^C(\\..*)?$' "
+									 "OR colliculocale ~ '^POSIX(\\..*)?$'");
+
+		i_collid = PQfnumber(coll_res, "collid");
+		i_collname = PQfnumber(coll_res, "collname");
+
+		coll_ntups = PQntuples(coll_res);
+
+		for (int rowno = 0; rowno < coll_ntups; rowno++)
+		{
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+				pg_fatal("could not open file \"%s\": %s",
+						 output_path, strerror(errno));
+			fprintf(script, "database %s collation %s (oid=%s)\n",
+					active_db->db_name,
+					PQgetvalue(coll_res, rowno, i_collname),
+					PQgetvalue(coll_res, rowno, i_collid));
+		}
+
+		PQclear(coll_res);
+		PQfinish(db_conn);
+	}
+
+	if (script)
+	{
+		pg_log(PG_REPORT, "fatal");
+		pg_fatal("Your installation contains ICU collations with the locale \"C\" or \"POSIX\",\n"
+				 "which are not supported until version 16. Earlier versions using ICU \n"
+				 "collations with the \"C\" or \"POSIX\" locales cannot be upgraded.");
+	}
+
+	check_ok();
+}
+
 /*
  * check_for_jsonb_9_4_usage()
  *
-- 
2.34.1

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Jeff Davis (#8)

Re: pg_collation.collversion for C.UTF-8

On Fri, 2023-05-26 at 10:43 -0700, Jeff Davis wrote:

We still need to consider backwards compatibility. If someone has a
collation with locale name C.UTF-8 in an earlier version, any change
to
the interpretation of that locale name after an upgrade carries a
corruption risk. The risks are different in ICU vs libc:

...

For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up. Also, I'm not sure how obscure the cases are even
if
there is a difference; perhaps they don't often occur in practice?
It's
not clear to me how we mitigate this risk further, though.

We can avoid this risk by converting C.anything or POSIX.anything to
plain "C" or "POSIX", respectively, for new collations before storing
the string in the catalog. For upgraded collations, we can preserve the
existing locale name. When opening the locale, we would still only
recognize plain "C" and "POSIX" as the C locale.

That would be more consistent behavior for new users, without creating
a backwards compatibility problem for existing users who happened to
create a collation with C.UTF-8.

For ICU users, we'd still need the upgrade check, because even the "C"
locale was not implemented with memcmp in prior versions. But I think
that's fine and should be done anyway, as the behavior in that case was
incorrect and was almost certainly a mistake by the user.

Regards,
Jeff Davis

#10

Daniel Verite

daniel@manitou-mail.org

over 2 years ago

In reply to: Jeff Davis (#9)

Re: pg_collation.collversion for C.UTF-8

Jeff Davis wrote:

For libc: this change may affect any user who happened to have
LANG=C.UTF-8 in their environment at initdb time, which is probably a
lot of users, and some buildfarm members. However, the average risk
seems to be much lower, because we've gone a long time with the
assumption that C.UTF-8 has the same behavior as C, and this only
recently came up.

Currently, neither lc_collate_is_c() nor lookup_collation_cache()
think that C.UTF-8 is a C collation, since they do that kind of test:

if (strcmp(localeptr, "C") == 0)
result = true;
else if (strcmp(localeptr, "POSIX") == 0)
result = true;
else
result = false;

What is relatively new (v15) is that we compute a version for libc
collations in get_collation_actual_version(), with code that assumes
that C.* does not need a version, implying that it's immune to
Unicode changes. What came up in this thread is that this assumption
is not true for at least one major platform: Debian/Ubuntu for
releases occurring before 2022 (glibc < 2.35).

We can avoid this risk by converting C.anything or POSIX.anything to
plain "C" or "POSIX", respectively, for new collations before storing
the string in the catalog. For upgraded collations, we can preserve the
existing locale name. When opening the locale, we would still only
recognize plain "C" and "POSIX" as the C locale.

Then Postgres would not sort the same as the operating system with the
same locale, at least on some OS. Concerning glibc, after waiting a
few years, glibc<2.35 will be obsolete, and C.UTF-8 sorting like C
will happen by itself.
But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#11

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Daniel Verite (#10)

Re: pg_collation.collversion for C.UTF-8

On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote:

But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.

I can see both points of view. It could be surprising to users if
C.UTF-8 does not sort like C/memcmp, or surprising if it changes out
from under them. It could also be surprising that it wouldn't sort like
the current OS's libc interpretation of C.UTF-8.

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

a. Just pass it to the provider and see what happens (older versions of
ICU would interpret it as en-US-u-va-posix; newer versions would give
the root locale).

b. Consistently interpret it as en-US-u-va-posix.

c. Don't pass it to the provider at all and treat it with memcmp
semantics.

Regards,
Jeff Davis

#12

Joe Conway

mail@joeconway.com

over 2 years ago

In reply to: Jeff Davis (#11)

Re: pg_collation.collversion for C.UTF-8

On 6/6/23 15:23, Jeff Davis wrote:

On Mon, 2023-06-05 at 19:43 +0200, Daniel Verite wrote:

But in the meantime, personally I don't quite see why Postgres should
start forcing C.UTF-8 to sort differently in the database than in the
OS.

I can see both points of view. It could be surprising to users if
C.UTF-8 does not sort like C/memcmp, or surprising if it changes out
from under them. It could also be surprising that it wouldn't sort like
the current OS's libc interpretation of C.UTF-8.

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

a. Just pass it to the provider and see what happens (older versions of
ICU would interpret it as en-US-u-va-posix; newer versions would give
the root locale).

b. Consistently interpret it as en-US-u-va-posix.

c. Don't pass it to the provider at all and treat it with memcmp
semantics.

Personally I think this should be (a). However we should also clearly
document that the semantics of such is provider/OS dependent and
therefore may not be what is expected/desired.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Daniel Verite

daniel@manitou-mail.org

over 2 years ago

In reply to: Jeff Davis (#11)

Re: pg_collation.collversion for C.UTF-8

Jeff Davis wrote:

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

a. Just pass it to the provider and see what happens (older versions of
ICU would interpret it as en-US-u-va-posix; newer versions would give
the root locale).

b. Consistently interpret it as en-US-u-va-posix.

c. Don't pass it to the provider at all and treat it with memcmp
semantics.

I think b) and c) are quite problematic.

First, en-US-u-va-posix does not sort like C.UTF-8 in glibc.
For one thing it seems that en-US-u-va-posix assigns zero weights to
some codepoints, which makes it semantically definitely different.
For instance consider ZERO WIDTH SPACE (U+200B):

postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8";
?column?
----------
t

postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu";
?column?
----------
f

Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356),
for our purpose, either it sorts by codepoints or it does not,
and it clearly does not. One consequence is that
en-US-u-va-posix-x-icu needs to be versioned and indexes
depending on it need to be rebuilt on upgrades.
OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35,
is to not need that.

Also it's not just about sorting. The semantics for the ctype-kind
functions are also different.

Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match
ASCII characters 0-9, or 10 codepoints.
With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising
all the digit characters in all languages, plus a bunch of variants
for mathematical symbols.

For instance consider U+FF10 (Fullwidth Digit Zero):

postgres=# select E'\uff10' collate "C.utf8" ~ '\d';
?column?
----------
f

postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d';
?column?
----------
t

If someone dumps their C.UTF-8 database to reload into an
ICU/en-US-u-va-posix database, there is no guarantee that it
even reloads because of semantic differences occuring
in constraints. In general it will surely reload, but the apps
might not behave the same with the new database
in a way that might be problematic.
It's fine if that's what they want and they explicitly ask for this
conversion, but it's not fine if it's postgres that has quietly
decided that for them.

About c) "don't pass it to the operators", it would be doable for
sorting (ignoring the "glibc before 2.35 does not sort like that" issue)
but not for the ctype-kind functions, where postgres' own code
doesn't have the Unicode knowledge.

About a) "just pass it to the provider", that seems better than b) or
c), but still, when a user asks for provider=icu locale=C.UTF-8,
it's a very probably a pilot error.

To me the user would be best served by a warning, if not an error,
informing them that it's quite probably not the combination they want.

[1]: https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356
https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#14

Daniel Verite

daniel@manitou-mail.org

over 2 years ago

In reply to: Daniel Verite (#13)

Re: pg_collation.collversion for C.UTF-8

I wrote:

Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we
only match ASCII characters 0-9, or 10 codepoints. With
"en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the
digit characters in all languages, plus a bunch of variants for
mathematical symbols.

BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu"
difference.
If think that any glibc-based locale will consider that \d
in a regexp means [0-9], and that any ICU locale
will make \d match a much larger variety of characters.

While moving to ICU by default, we should expect that
differences like that will affect apps in a way that might be
more or less disruptive.

Another known difference it that upper() with ICU does not do a
character-by-character conversion, for instance:

WITH words(w) as (values('muß'),('ﬁnal'))
SELECT
w,
length(w),
upper(w collate "C.utf8") as "upper (libc)",
length(upper(w collate "C.utf8")),
upper(w collate "en-x-icu") as "upper (ICU)",
length(upper(w collate "en-x-icu"))
FROM words;

w | length | upper libc | length | upper ICU | length
------+--------+------------+--------+-----------+--------
muß | 3 | MUß | 3 | MUSS | 4
ﬁnal | 4 | ﬁNAL | 4 | FINAL | 5

The fact that the resulting string is larger that the original
might cause problems.

In general, we can't abstract from the fact that ICU semantics
are different.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#15

Peter Eisentraut

peter@eisentraut.org

over 2 years ago

In reply to: Jeff Davis (#11)

Re: pg_collation.collversion for C.UTF-8

On 06.06.23 21:23, Jeff Davis wrote:

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

It should be an error.

a. Just pass it to the provider and see what happens (older versions of
ICU would interpret it as en-US-u-va-posix; newer versions would give
the root locale).

This, but with an error instead of a warning.

#16

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Peter Eisentraut (#15)

Re: pg_collation.collversion for C.UTF-8

On Wed, 2023-06-07 at 23:28 +0200, Peter Eisentraut wrote:

On 06.06.23 21:23, Jeff Davis wrote:

What about ICU? How should provider=icu locale=C.UTF-8 behave? We
could:

It should be an error.

a. Just pass it to the provider and see what happens (older
versions of
ICU would interpret it as en-US-u-va-posix; newer versions would
give
the root locale).

This, but with an error instead of a warning.

If we do that, a plain "initdb -D data" will fail if LANG=C.UTF-8.

Perhaps that's fine, but certainly some buildfarm members would
complain. I'm not sure how many users would be affected.

Regards,
Jeff Davis

#17

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Daniel Verite (#5)

Re: pg_collation.collversion for C.UTF-8

On Sun, Apr 23, 2023 at 5:22 AM Daniel Verite <daniel@manitou-mail.org> wrote:

I understand that my proposal to version C.* like any other collation
might be erring on the side of caution, but ignoring these collation
changes on at least one major OS does not feel right either.
Maybe we should consider doing platform-dependent checks?

Hmm, OK let's explore that. What could we do that would be helpful
here, without affecting users of the "true" C.UTF-8 for the rest of
time? This is a Debian (+ downstream distro) only problem as far as
we know so far, and only for Debian 11 and older. Debian 12 was just
released the other day as the new stable, so the window of opportunity
to actually help anyone with our actions in this area is now beginning
to close, because we're really talking about new databases initdb'd
and new COLLATIONs CREATEd on Debian old stable after our next
(August) release. The window may be much wider for long term Ubuntu
releases. You're right that we could do something platform-specific
to help with that: we could change that code so that it captures the
version for C.* #if __GLIBC_MAJOR__ == 2 && __GLIBC_MINOR__ < 35 (or
we could parse the string returned by the runtime version function).
I don't recall immediately what our warning code does if it sees a
change from versioned to not versioned, but we could make sure it does
what you want here. That way we wouldn't burden all future users of
C.* with version warnings (because it'll be empty), but we'd catch
those doing Debian 11 -> 12 upgrades, and whatever Ubuntu upgrades
that corresponds to, etc. Is it worth it?

#18

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Thomas Munro (#17)

Re: pg_collation.collversion for C.UTF-8

On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:

Hmm, OK let's explore that. What could we do that would be helpful
here, without affecting users of the "true" C.UTF-8 for the rest of
time?

Where is the "true" C.UTF-8 defined?

I assume you mean that the collation order can't (shouldn't, anyway)
change. But what about the ctype (upper/lower/initcap) behavior? Is
that also locked down for all time, or could it change if some new
unicode characters are added?

Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Regards,
Jeff Davis

#19

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Jeff Davis (#18)

Re: pg_collation.collversion for C.UTF-8

On Sat, Jun 17, 2023 at 10:03 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-06-15 at 19:15 +1200, Thomas Munro wrote:

Hmm, OK let's explore that. What could we do that would be helpful
here, without affecting users of the "true" C.UTF-8 for the rest of
time?

Where is the "true" C.UTF-8 defined?

By "true" I just meant glibc's official one, in contrast to the
imposter from Debian oldstable's patches. It's not defined by any
standard, but we only know how to record versions for glibc, FreeBSD
and Windows, and we know what the first two of those do for that
locale because they tell us (see below). For Windows, the manual's
BNF-style description of acceptable strings doesn't appear to accept
C.UTF-8 (but I haven't tried it).

I assume you mean that the collation order can't (shouldn't, anyway)
change. But what about the ctype (upper/lower/initcap) behavior? Is
that also locked down for all time, or could it change if some new
unicode characters are added?

Fair point. Considering that our collversion effectively functions as
a proxy for ctype version too, Daniel's patch makes a certain amount
of sense.

Our versioning is nominally based only on the collation category, not
locales more generally or any other category they contain (nominally,
as in: we named it collversion, and our code and comments and
discussions so far only contemplated collations in this context).
But, clearly, changes to underlying ctype data could also cause a
constraint CHECK (x ~ '[[:digit:]]') or a partial index with WHERE
(upper(x) <> 'ẞ') to be corrupted, which I'd considered to be a
separate topic, but Daniel's patch would cover with the same
mechanism. (Actually I just learned that [[:digit:]] is a bad example
on a glibc system, because they appear to have hardcoded a test for
[0-9] into their iswdigit_l() implementation, but FreeBSD gives the
Unicode answer, which is subject to change, and other classes may work
better on glibc.)

Would it be correct to interpret LC_COLLATE=C.UTF-8 as LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Yes. The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE. For implementation
reasons the glibc people don't quite describe it that way[1]https://sourceware.org/glibc/wiki/Proposals/C.UTF-8: for
LC_COLLATE, they decode to codepoints first and then compare those
using a new codepath they had to write for release 2.35, while FreeBSD
skips that useless step and compares raw UTF-8 bytes like
LC_COLLATE=C[2]https://reviews.freebsd.org/D17833. Which is the same, as RFC 3692 tells us:

o The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

It is interesting to note that LC_COLLATE=C, LC_CTYPE=C.UTF-8 is
equivalent, but would not get version warnings with Daniel's patch,
revealing that it's only a proxy. But recording ctype version
separately would be excessive.

For completeness, Solaris also has C.UTF-8. I can't read about what
it does, the release notes are behind a signup thing. *shrug* I
can't find any other systems that have it.

[1]: https://sourceware.org/glibc/wiki/Proposals/C.UTF-8
[2]: https://reviews.freebsd.org/D17833

#20

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Thomas Munro (#19)

Re: pg_collation.collversion for C.UTF-8

On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:

Would it be correct to interpret LC_COLLATE=C.UTF-8 as
LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Yes. The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE.

If that's true, and we version C.UTF-8, then users could still get the
behavior they want, a stable collation order, and benefit from the
optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.

The only caveat is to be careful with things that depend on ctype in
indexes and constraints. While still a problem, it's a smaller problem
than unversioned collation. We should think a little more about solving
it, because I think there's a strong case to be made that a default
collation of C and a database ctype of something else is a good
combination (it makes less sense for a case-insensitive collation, but
those aren't allowed as a default collation).

In any case, we're better off following the rule "version anything that
goes to any external provider, period". And by "version", I really mean
a best effort, because we don't always have great information, but I
think it's better to record what we do have than not. We have just seen
too many examples of weird behavior. On top of that, it's simply
inconsistent to assume that C=C.UTF-8 for collation version, but not
for the collation implementation.

Users might get frustrated that the collation for C.UTF-8 is versioned,
of course. But I don't think it will affect anyone for quite some time,
because existing users will have a datcollversion=NULL; so they won't
get the warnings until they refresh the versions (or create new
collations/databases), and then after that upgrade libc. Right? So they
should have time to adjust to use LC_COLLATE=C if that's what they
want.

An alternative would be to define lc_collate_is_c("C.UTF-8") == true
while leaving lc_ctype_is_c("C.UTF-8") == false and
get_collation_actual_version("C.UTF-8") == NULL. In that case we would
not be passing it to an external provider, so we don't have to version
it. But that might be a little too magical and I'm not inclined to do
that.

Another alternative would be to implement C.UTF-8 internally according
to the "true" semantics, if they are truly simple and well-defined and
stable. But I don't think ctype=C.UTF-8 is actually stable because new
characters can be added, right?

Regards,
Jeff Davis

#21

Daniel Verite

daniel@manitou-mail.org

over 2 years ago

In reply to: Thomas Munro (#17)

Re: pg_collation.collversion for C.UTF-8

Thomas Munro wrote:

What could we do that would be helpful here, without affecting users
of the "true" C.UTF-8 for the rest of time? This is a Debian (+
downstream distro) only problem as far as we know so far, and only
for Debian 11 and older.

It seems to include RedHat-based distros as well.

According to https://bugzilla.redhat.com/show_bug.cgi?id=902094
C.utf8 was added in 2015 and backported down to Fedora 22.
RHEL8 / CentOS 8 / Rocky8 provide glibc 2.28 with a C.utf8
locale. We can reasonably suspect that they've been using the same
kind of patches as Debian before version 12, with not all codepoints
being sorted bytewise.

RHEL9 comes with glibc 2.34 according to distrowatch [1]https://distrowatch.com/table.php?distribution=redhat and the
announcement [2]https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9, so presumably it also lacks the "real" C.utf8
with bytewise sorting that glibc 2.35 upstream added.

[1]: https://distrowatch.com/table.php?distribution=redhat
[2]: https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9
https://developers.redhat.com/articles/2022/05/18/whats-new-red-hat-enterprise-linux-9

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#22

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Jeff Davis (#20)

Re: pg_collation.collversion for C.UTF-8

On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:

Would it be correct to interpret LC_COLLATE=C.UTF-8 as
LC_COLLATE=C,
but leave LC_CTYPE=C.UTF-8 as-is?

Yes. The basic idea, at least for these two OSes, is that every
category behaves as if set to C, except LC_CTYPE.

If that's true, and we version C.UTF-8, then users could still get the
behavior they want, a stable collation order, and benefit from the
optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.

The only caveat is to be careful with things that depend on ctype in
indexes and constraints. While still a problem, it's a smaller problem
than unversioned collation. We should think a little more about solving
it, because I think there's a strong case to be made that a default
collation of C and a database ctype of something else is a good
combination (it makes less sense for a case-insensitive collation, but
those aren't allowed as a default collation).

In any case, we're better off following the rule "version anything that
goes to any external provider, period". And by "version", I really mean
a best effort, because we don't always have great information, but I
think it's better to record what we do have than not. We have just seen
too many examples of weird behavior. On top of that, it's simply
inconsistent to assume that C=C.UTF-8 for collation version, but not
for the collation implementation.

Yeah, OK, you're convincing me. It's hard to decide because our model
is basically wrong so it's only warning you about potential ctype
changes by happy coincidence, but even in respect of sort order it was
probably a mistake to start second-guessing what libc is doing, and
with that observation about the C/C.UTF-8 combination, at least an
end-user has a way to opt in/out of this choice. I'll try to write a
concise commit message for Daniel's patch explaining all this and we
can see about squeaking it into beta2.

Use rs might get frustrated that the collation for C.UTF-8 is versioned,
of course. But I don't think it will affect anyone for quite some time,
because existing users will have a datcollversion=NULL; so they won't
get the warnings until they refresh the versions (or create new
collations/databases), and then after that upgrade libc. Right? So they
should have time to adjust to use LC_COLLATE=C if that's what they
want.

Yeah.

An alternative would be to define lc_collate_is_c("C.UTF-8") == true
while leaving lc_ctype_is_c("C.UTF-8") == false and
get_collation_actual_version("C.UTF-8") == NULL. In that case we would
not be passing it to an external provider, so we don't have to version
it. But that might be a little too magical and I'm not inclined to do
that.

Agreed, let's not do any more of that sort of thing.

Another alternative would be to implement C.UTF-8 internally according
to the "true" semantics, if they are truly simple and well-defined and
stable. But I don't think ctype=C.UTF-8 is actually stable because new
characters can be added, right?

Correct.