Rework of collation code, extensibility

Started by Jeff Davisabout 3 years ago24 messages

pgsql@j-davis.com

about 3 years ago

6 attachment(s)

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1]/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com[2]/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com in
a new thread given its new scope.

Benefits:

1. Clearer division of responsibilities.
2. More consistent between libc and ICU providers.
3. Hooks that allow extensions to replace collation provider libraries.
4. New tests for the collation provider library hooks.

There are a lot of changes, and still some loose ends, but I believe a
few of these patches are close to ready.

This set of changes does not express an opinion on how we might want to
support multiple provider libraries in core; but whatever we choose, it
should be easier to accomplish. Right now, the hooks have limited
information on which to make the choice for a specific version of a
collation provider library, but that's because there's limited
information in the catalog. If the discussion here[3]/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com concludes in
adding collation provider library or library version information to the
catalog, we can add additional parameters to the hooks.

[1]: /messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
[2]: /messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
[3]: /messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com
/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v4-0001-Add-pg_strcoll-and-pg_strncoll.patchtext/x-patch; charset=UTF-8; name=v4-0001-Add-pg_strcoll-and-pg_strncoll.patchDownload

From f608d0ff0e8cad3f87c1b71cdd8796a6853c1936 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v4 1/6] Add pg_strcoll() and pg_strncoll().

Callers with NUL-terminated strings should call the former; callers
with strings and their length should call the latter.
---
 src/backend/utils/adt/pg_locale.c | 420 ++++++++++++++++++++++++++++--
 src/backend/utils/adt/varlena.c   | 230 +---------------
 src/include/utils/pg_locale.h     |   3 +
 3 files changed, 406 insertions(+), 247 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..6cd629ecb4 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -123,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1750,356 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
 
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
+	return result;
+}
+
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2127,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2175,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1c52deec55..52dd0bbba8 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1537,10 +1537,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1557,171 +1553,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
-
-		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 	}
 
 	return result;
@@ -2379,65 +2211,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
-
-	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
-		result = strcmp(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..bf70ae08ca 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,9 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchtext/x-patch; charset=UTF-8; name=v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchDownload

From fab2a5c30b560c59f032c468343027a8ca69ac6b Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:41:38 -0800
Subject: [PATCH v4 2/6] Add pg_strxfrm() and pg_strxfrm_prefix().

Callers with a NUL-terminated string should call the former; callers
with a string and length should call the latter.

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 ++--
 src/backend/utils/adt/pg_locale.c   | 381 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varchar.c     |  41 ++-
 src/backend/utils/adt/varlena.c     | 142 +++--------
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h       |  11 +
 6 files changed, 481 insertions(+), 150 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index f890f79ee1..b8136e496f 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -291,21 +291,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -349,21 +347,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6cd629ecb4..663286163e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -94,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -2099,6 +2102,384 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 }
 
 
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_libc(char *dest, const char *src, size_t destsize,
+					   pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+		 locale->provider);
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_prefix_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
 #ifdef USE_ICU
 static void
 init_icu_converter(void)
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 01a2db6b23..1963d4dc4e 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 52dd0bbba8..8da545841f 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1889,20 +1889,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -1918,13 +1904,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -1937,10 +1917,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2229,6 +2207,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2241,7 +2220,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2276,14 +2255,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2301,7 +2276,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2309,81 +2284,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2395,12 +2338,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec66..c4a6c5cb83 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1933,6 +1933,17 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"trust_strxfrm", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
 			gettext_noop("Whether to continue running after a failure to sync data files."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index bf70ae08ca..2bd96ab7e1 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool trust_strxfrm;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -103,6 +104,16 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int pg_strncoll(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v4-0003-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v4-0003-Refactor-pg_locale_t-routines.patchDownload

From 41cd639112c41cce7eabd06569fd6cf85d04cd40 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v4 3/6] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |   8 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 400 insertions(+), 323 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b8136e496f..6d9f014c5b 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -281,36 +281,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -336,7 +328,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -344,30 +336,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 81e54e0ce6..9e84da4891 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 02d462a659..ac05efb558 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 65746c48d2..3067431cbe 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 
 /* ----------
@@ -1576,7 +1577,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1584,7 +1585,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1697,7 +1698,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1730,7 +1731,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1819,7 +1820,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1852,7 +1853,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1944,10 +1945,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1989,10 +1990,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 8e671b9fab..98714a0492 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 
 #define LIKE_TRUE						1
@@ -96,7 +97,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2d3aaaaf6b..28d23ac3ab 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 663286163e..d8df2a3832 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2111,7 +2193,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2717,8 +2799,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2792,10 +2874,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2869,10 +2951,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 1963d4dc4e..9782d02113 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8da545841f..0408995791 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1205,7 +1205,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1603,7 +1603,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1662,7 +1662,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1776,7 +1776,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..ead7922c4e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 2bd96ab7e1..79cfacb302 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -65,39 +49,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -115,10 +72,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v4-0004-Support-multiple-ICU-collation-provider-libraries.patchtext/x-patch; charset=UTF-8; name=v4-0004-Support-multiple-ICU-collation-provider-libraries.patchDownload

From 15243bd82e44d83e90751c51e30dcce4e9c3a39f Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v4 4/6] Support multiple ICU collation provider libraries.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

Offer a hook to allow an extension to control which ICU library is
loaded for a given locale and version.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++--
 src/backend/utils/adt/pg_locale.c      | 299 ++++++++++++++++++-------
 src/include/utils/pg_locale_internal.h | 121 +++++++++-
 4 files changed, 398 insertions(+), 106 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 9e84da4891..4e90bfa65e 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -560,13 +560,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -585,11 +586,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -809,12 +811,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_default_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -825,7 +828,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 3067431cbe..7f318fe2c7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1565,6 +1565,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1572,6 +1577,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1589,18 +1595,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1666,11 +1696,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1788,11 +1819,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1911,11 +1943,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index d8df2a3832..5521fe11a5 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -110,6 +112,41 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * The precise version of a collation provider library is important, because
+ * subtle changes in collation between library versions can corrupt
+ * indexes. This hook allows control over how collation provider libraries are
+ * loaded.
+ *
+ * If the hook is not set, or if it returns NULL, Postgres constructs the
+ * pg_icu_library structure from the symbols Postgres is linked with at build
+ * time.
+ *
+ * If the hook is set, it can instead construct the pg_icu_library structure
+ * using custom logic. Ordinarily, this custom logic would involve finding a
+ * specific known version of the collation provider library, and dynamically
+ * loading the necessary symbols. If the collation version obtained from the
+ * library does not match the collation version recorded in the catalog,
+ * Postgres will issue a WARNING.
+ *
+ * The returned structure must be allocated in TopMemoryContext, and the
+ * associated symbols must remain valid permanently. It's expected that the
+ * hook will maintain a small finite list of libraries and return a pointer to
+ * the appropraite one, so that resource use remains bounded.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU locale
+ * name, and the collation version as recorded in the catalog. We should
+ * consider what other information can be provided to allow for greater
+ * control over which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
+#ifdef USE_ICU
+static pg_icu_library *builtin_icu_library = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -136,8 +173,12 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+#ifdef USE_ICU
+static pg_icu_library *default_icu = NULL;
+#endif
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -145,13 +186,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1458,6 +1503,78 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+static pg_icu_library *
+get_builtin_icu_library()
+{
+	pg_icu_library *lib;
+
+	if (builtin_icu_library != NULL)
+		return builtin_icu_library;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	builtin_icu_library = lib;
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib = NULL;
+
+	if (get_icu_library_hook != NULL)
+		lib = get_icu_library_hook(locale, version);
+
+	if (!lib)
+		lib = get_builtin_icu_library();
+
+	return lib;
+}
+
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1554,20 +1671,24 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_icu_library(collate, version);
+
+		Assert(strcmp(collate, ctype) == 0);
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1589,6 +1710,14 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+#ifdef USE_ICU
+pg_icu_library *
+get_default_icu_library()
+{
+	return default_icu;
+}
+#endif
+
 /*
  * Initialize default database locale.
  */
@@ -1596,6 +1725,10 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+#ifdef USE_ICU
+	default_icu = get_icu_library(iculocale, version);
+#endif
+
 	/*
 	 * For the purposes of pg_locale_t, if the provider is ICU, we use
 	 * iculocale for both collate and ctype.
@@ -1759,17 +1892,18 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_icu_library(collcollate, NULL);
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2021,16 +2155,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2041,12 +2176,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2067,6 +2202,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2076,13 +2212,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2261,12 +2398,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2275,11 +2413,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2308,13 +2446,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2323,21 +2462,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2346,6 +2483,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2356,19 +2494,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2564,7 +2700,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2581,11 +2717,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2594,14 +2730,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2610,16 +2747,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2636,16 +2774,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2663,30 +2802,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2702,7 +2843,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2776,7 +2918,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2785,7 +2927,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
@@ -2801,17 +2943,18 @@ check_icu_locale(const char *icu_locale)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	pg_icu_library *iculib = get_icu_library(icu_locale, NULL);
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = iculib->openCollator(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, iculib->errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+		icu_set_collation_attributes(iculib, collator, icu_locale);
+	iculib->closeCollator(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..afb6c24d64 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,106 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +155,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +164,18 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
+
+typedef pg_icu_library *(*get_icu_library_hook_type)(
+	const char *locale, const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
+extern pg_icu_library *get_default_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
+
+#endif							/* USE_ICU */
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v4-0005-Support-multiple-libc-collation-provider-librarie.patchtext/x-patch; charset=UTF-8; name=v4-0005-Support-multiple-libc-collation-provider-librarie.patchDownload

From 375653b36bf9248caee0ec5fce476ef613e396c4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 14 Dec 2022 15:12:35 -0800
Subject: [PATCH v4 5/6] Support multiple libc collation provider libraries.

Introduce structure pg_libc_library, which holds pointers to each
required libc method related to locale_t, encoding, collation, or
ctype. Store this as part of pg_locale_t. Each call to one of these
libc functions instead goes through this structure, so that it can
more easily be replaced by a non-builtin libc library.

Offer a hook to allow an extension to control which libc library
is loaded.

Adds a test module to test the hook functionality.

TODO: loose ends in regex code and possibly other places where libc
functions are called directly.

Author: Jeff Davis
---
 src/backend/utils/adt/formatting.c     |  31 +++--
 src/backend/utils/adt/pg_locale.c      | 185 +++++++++++++++++++------
 src/include/utils/pg_locale_internal.h |  73 +++++++++-
 3 files changed, 236 insertions(+), 53 deletions(-)

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 7f318fe2c7..079ddebd50 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -69,7 +69,6 @@
 #include <math.h>
 #include <float.h>
 #include <limits.h>
-#include <wctype.h>
 
 #ifdef USE_ICU
 #include <unicode/ustring.h>
@@ -1729,10 +1728,16 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towlower(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1852,10 +1857,16 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towupper(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1977,19 +1988,21 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
 					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
 					{
+						pg_libc_library *libc = get_default_libc_library();
 						if (wasalnum)
-							workspace[curr_char] = towlower(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
 						else
-							workspace[curr_char] = towupper(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
 						wasalnum = iswalnum(workspace[curr_char]);
 					}
 				}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 5521fe11a5..751fa7931b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -115,19 +115,19 @@ char	   *localized_full_months[12 + 1];
 /*
  * The precise version of a collation provider library is important, because
  * subtle changes in collation between library versions can corrupt
- * indexes. This hook allows control over how collation provider libraries are
- * loaded.
+ * indexes. These hooks allows control over how collation provider libraries
+ * are loaded.
  *
  * If the hook is not set, or if it returns NULL, Postgres constructs the
- * pg_icu_library structure from the symbols Postgres is linked with at build
- * time.
+ * pg_libc_library or pg_icu_library structures from the symbols Postgres is
+ * linked with at build time.
  *
- * If the hook is set, it can instead construct the pg_icu_library structure
- * using custom logic. Ordinarily, this custom logic would involve finding a
- * specific known version of the collation provider library, and dynamically
- * loading the necessary symbols. If the collation version obtained from the
- * library does not match the collation version recorded in the catalog,
- * Postgres will issue a WARNING.
+ * If the hook is set, it can instead construct the pg_libc_library or
+ * pg_icu_library structures using custom logic. Ordinarily, this custom logic
+ * would involve finding a specific known version of the collation provider
+ * library, and dynamically loading the necessary symbols. If the collation
+ * version obtained from the library does not match the collation version
+ * recorded in the catalog, Postgres will issue a WARNING.
  *
  * The returned structure must be allocated in TopMemoryContext, and the
  * associated symbols must remain valid permanently. It's expected that the
@@ -139,10 +139,12 @@ char	   *localized_full_months[12 + 1];
  * consider what other information can be provided to allow for greater
  * control over which library is loaded.
  */
+get_libc_library_hook_type get_libc_library_hook = NULL;
 #ifdef USE_ICU
 get_icu_library_hook_type get_icu_library_hook = NULL;
 #endif
 
+static pg_libc_library *builtin_libc_library = NULL;
 #ifdef USE_ICU
 static pg_icu_library *builtin_icu_library = NULL;
 #endif
@@ -173,6 +175,7 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+static pg_libc_library *default_libc = NULL;
 #ifdef USE_ICU
 static pg_icu_library *default_icu = NULL;
 #endif
@@ -1392,7 +1395,7 @@ lc_collate_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_COLLATE, NULL);
+		localeptr = default_libc->c_setlocale(LC_COLLATE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_COLLATE setting");
 
@@ -1445,7 +1448,7 @@ lc_ctype_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_CTYPE, NULL);
+		localeptr = default_libc->c_setlocale(LC_CTYPE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_CTYPE setting");
 
@@ -1503,6 +1506,75 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+static pg_libc_library *
+get_builtin_libc_library()
+{
+	pg_libc_library *lib = NULL;
+
+	if (builtin_libc_library != NULL)
+		return builtin_libc_library;
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_libc_library));
+#if defined(__GLIBC__)
+	lib->libc_version = gnu_get_libc_version;
+#else
+	lib->libc_version = NULL;
+#endif
+	lib->c_setlocale = setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = newlocale;
+	lib->c_freelocale = freelocale;
+	lib->c_uselocale = uselocale;
+#else
+	lib->_create_locale = _create_locale;
+#endif
+#endif
+	lib->c_wcstombs = wcstombs;
+	lib->c_mbstowcs = mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = strcoll;
+	lib->c_wcscoll = wcscoll;
+	lib->c_strxfrm = strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = strcoll_l;
+	lib->c_wcscoll_l = wcscoll_l;
+	lib->c_strxfrm_l = strxfrm_l;
+#endif
+	lib->c_iswalnum = iswalnum;
+	lib->c_towlower = towlower;
+	lib->c_towupper = towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_iswalnum_l = iswalnum_l;
+	lib->c_towlower_l = towlower_l;
+	lib->c_towupper_l = towupper_l;
+#endif
+
+	builtin_libc_library = lib;
+	return lib;
+}
+
+static pg_libc_library *
+get_libc_library(const char *collate, const char *ctype, const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (get_libc_library_hook != NULL)
+		lib = get_libc_library_hook(collate, ctype, version);
+
+	if(!lib)
+		lib = get_builtin_libc_library();
+
+	return lib;
+}
+
 #ifdef USE_ICU
 static pg_icu_library *
 get_builtin_icu_library()
@@ -1614,6 +1686,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	if (provider == COLLPROVIDER_LIBC)
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = get_libc_library(collate, ctype, version);
 		locale_t        loc;
 
 		/* newlocale's result may be leaked if we encounter an error */
@@ -1623,10 +1696,10 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			/* Normal case where they're the same */
 			errno = 0;
 #ifndef WIN32
-			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+			loc = libc->c_newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
 							NULL);
 #else
-			loc = _create_locale(LC_ALL, collate);
+			loc = libc->_create_locale(LC_ALL, collate);
 #endif
 			if (!loc)
 				report_newlocale_failure(collate);
@@ -1638,11 +1711,11 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			locale_t	loc1;
 
 			errno = 0;
-			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			loc1 = libc->c_newlocale(LC_COLLATE_MASK, collate, NULL);
 			if (!loc1)
 				report_newlocale_failure(collate);
 			errno = 0;
-			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			loc = libc->c_newlocale(LC_CTYPE_MASK, ctype, loc1);
 			if (!loc)
 				report_newlocale_failure(ctype);
 #else
@@ -1659,6 +1732,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 		}
 
 		result->info.libc.lt = loc;
+		result->info.libc.lib = libc;
 #else							/* not HAVE_LOCALE_T */
 		/* platform that doesn't support locale_t */
 		ereport(ERROR,
@@ -1710,6 +1784,12 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+pg_libc_library *
+get_default_libc_library()
+{
+	return default_libc;
+}
+
 #ifdef USE_ICU
 pg_icu_library *
 get_default_icu_library()
@@ -1725,6 +1805,19 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+	default_libc = get_libc_library(collate, ctype, version);
+
+	/*
+	 * If it's the builtin libc, pg_perm_setlocale() will handle
+	 * setlocale(). Otherwise, it must be done here so that the collate and
+	 * ctype take effect.
+	 */
+	if (default_libc != builtin_libc_library)
+	{
+		default_libc->c_setlocale(LC_COLLATE, collate);
+		default_libc->c_setlocale(LC_CTYPE, ctype);
+	}
+
 #ifdef USE_ICU
 	default_icu = get_icu_library(iculocale, version);
 #endif
@@ -1914,18 +2007,21 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			pg_strcasecmp("POSIX", collcollate) != 0)
 	{
 #if defined(__GLIBC__)
+		pg_libc_library *libc = get_libc_library(collcollate, NULL, NULL);
 		/* Use the glibc version because we don't have anything better. */
-		collversion = pstrdup(gnu_get_libc_version());
+		if (libc->libc_version != NULL)
+			collversion = pstrdup(libc->libc_version());
 #elif defined(LC_VERSION_MASK)
+		pg_libc_library *libc = get_libc_library(collcollate, NULL, NULL);
 		locale_t	loc;
 
 		/* Look up FreeBSD collation version. */
-		loc = newlocale(LC_COLLATE, collcollate, NULL);
+		loc = libc->c_newlocale(LC_COLLATE, collcollate, NULL);
 		if (loc)
 		{
 			collversion =
 				pstrdup(querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, loc));
-			freelocale(loc);
+			libc->c_freelocale(loc);
 		}
 		else
 			ereport(ERROR,
@@ -2031,10 +2127,14 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		result = libc->c_wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	}
 	else
 #endif
-		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+		result = default_libc->c_wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+
 	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
 								 * headers */
 		ereport(ERROR,
@@ -2060,7 +2160,6 @@ static int
 pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 {
 	int result;
-
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 #ifdef WIN32
 	if (GetDatabaseEncoding() == PG_UTF8)
@@ -2074,14 +2173,16 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
+		result = libc->c_strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
 #endif
 	}
 	else
-		result = strcoll(arg1, arg2);
+		result = default_libc->c_strcoll(arg1, arg2);
 
 	return result;
 }
@@ -2327,17 +2428,15 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 {
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 
-#ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		return libc->c_strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	}
 	else
 #endif
-		return strxfrm(dest, src, destsize);
-#else
-	/* shouldn't happen */
-	elog(ERROR, "unsupported collprovider: %c", locale->provider);
-#endif
+		return default_libc->c_strxfrm(dest, src, destsize);
 }
 
 static size_t
@@ -3010,21 +3109,23 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 	if (locale == (pg_locale_t) 0)
 	{
 		/* Use wcstombs directly for the default locale */
-		result = wcstombs(to, from, tolen);
+		result = default_libc->c_wcstombs(to, from, tolen);
 	}
 	else
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
+		result = libc->c_wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.libc.lt);
+		locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-		result = wcstombs(to, from, tolen);
+		result = libc->c_wcstombs(to, from, tolen);
 
-		uselocale(save_locale);
+		libc->c_uselocale(save_locale);
 #endif							/* HAVE_WCSTOMBS_L */
 #else							/* !HAVE_LOCALE_T */
 		/* Can't have locale != 0 without HAVE_LOCALE_T */
@@ -3087,21 +3188,23 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 		if (locale == (pg_locale_t) 0)
 		{
 			/* Use mbstowcs directly for the default locale */
-			result = mbstowcs(to, str, tolen);
+			result = default_libc->c_mbstowcs(to, str, tolen);
 		}
 		else
 		{
 #ifdef HAVE_LOCALE_T
+			pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
+			result = libc->c_mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.libc.lt);
+			locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-			result = mbstowcs(to, str, tolen);
+			result = libc->c_mbstowcs(to, str, tolen);
 
-			uselocale(save_locale);
+			libc->c_uselocale(save_locale);
 #endif							/* HAVE_MBSTOWCS_L */
 #else							/* !HAVE_LOCALE_T */
 			/* Can't have locale != 0 without HAVE_LOCALE_T */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index afb6c24d64..bc6e930580 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -13,6 +13,9 @@
 #ifndef _PG_LOCALE_INTERNAL_
 #define _PG_LOCALE_INTERNAL_
 
+#include <wchar.h>
+#include <wctype.h>
+
 #ifdef USE_ICU
 #include <unicode/ubrk.h>
 #include <unicode/ucnv.h>
@@ -32,6 +35,63 @@
 #endif
 #endif
 
+typedef struct pg_libc_library
+{
+	/* version */
+	const char *(*libc_version) (void);
+
+	/* locale */
+	char *(*c_setlocale) (int category, const char *locale);
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	locale_t (*c_newlocale) (int category_mask, const char *locale,
+						   locale_t base);
+	void (*c_freelocale) (locale_t locobj);
+	locale_t (*c_uselocale) (locale_t newloc);
+#else
+	_locale_t (*_create_locale) (int category, const char *locale);
+#endif
+#endif
+
+	/* encoding */
+	size_t (*c_wcstombs) (char *dest, const wchar_t *src, size_t n);
+	size_t (*c_mbstowcs) (wchar_t *dest, const char *src, size_t n);
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	size_t (*c_wcstombs_l) (char *dest, const wchar_t *src, size_t n,
+						  locale_t loc);
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	size_t (*c_mbstowcs_l) (wchar_t *dest, const char *src, size_t n,
+						  locale_t loc);
+#endif
+#endif
+
+	/* collation */
+	int (*c_strcoll) (const char *s1, const char *s2);
+	int (*c_wcscoll) (const wchar_t *ws1, const wchar_t *ws2);
+	size_t (*c_strxfrm) (char *s1, const char * s2, size_t n);
+#ifdef HAVE_LOCALE_T
+	int (*c_strcoll_l) (const char *s1, const char *s2, locale_t locale);
+	int (*c_wcscoll_l) (const wchar_t *ws1, const wchar_t *ws2,
+					  locale_t locale);
+	size_t (*c_strxfrm_l) (char *s1, const char * s2, size_t n,
+						 locale_t locale);
+#endif
+
+	/* ctype */
+	int (*c_iswalnum) (wint_t wc);
+	wint_t (*c_towlower) (wint_t wc);
+	wint_t (*c_towupper) (wint_t wc);
+#ifdef HAVE_LOCALE_T
+	int (*c_iswalnum_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towlower_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towupper_l) (wint_t wc, locale_t locale);
+#endif
+} pg_libc_library;
+
+#define PG_LIBC_LIB(x) ((x)->info.libc.lib)
+
 #ifdef USE_ICU
 /*
  * An ICU library version that we're either linked against or have loaded at
@@ -146,12 +206,13 @@ struct pg_locale_struct
 	char	   *ctype;
 	union
 	{
-#ifdef HAVE_LOCALE_T
 		struct
 		{
+#ifdef HAVE_LOCALE_T
 			locale_t	lt;
-		}			libc;
 #endif
+			pg_libc_library *lib;
+		}			libc;
 #ifdef USE_ICU
 		struct
 		{
@@ -159,10 +220,16 @@ struct pg_locale_struct
 			pg_icu_library	*lib;
 		}			icu;
 #endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+typedef pg_libc_library *(*get_libc_library_hook_type)(
+	const char *collate, const char *ctype, const char *version);
+
+extern PGDLLIMPORT get_libc_library_hook_type get_libc_library_hook;
+
+extern pg_libc_library *get_default_libc_library(void);
+
 #ifdef USE_ICU
 
 typedef pg_icu_library *(*get_icu_library_hook_type)(
-- 
2.34.1

v4-0006-Add-tests-for-collation-provider-hooks.patchtext/x-patch; charset=UTF-8; name=v4-0006-Add-tests-for-collation-provider-hooks.patchDownload

From 6651a9c2a99199f584862f39648692236eb4e911 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Sat, 17 Dec 2022 12:35:05 -0800
Subject: [PATCH v4 6/6] Add tests for collation provider hooks.

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_collation_lib_hooks/Makefile |  24 +
 .../test_collation_lib_hooks/meson.build      |  37 ++
 .../test_collation_lib_hooks/t/001_libc.pl    | 136 ++++++
 .../test_collation_lib_hooks/t/002_icu.pl     | 115 +++++
 .../test_collation_lib_hooks.c                |  49 +++
 .../test_collation_lib_hooks.control          |   4 +
 .../test_collation_lib_hooks.h                |  32 ++
 .../test_collation_lib_hooks/test_icu_hook.c  | 228 ++++++++++
 .../test_collation_lib_hooks/test_libc_hook.c | 413 ++++++++++++++++++
 11 files changed, 1040 insertions(+)
 create mode 100644 src/test/modules/test_collation_lib_hooks/Makefile
 create mode 100644 src/test/modules/test_collation_lib_hooks/meson.build
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/001_libc.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/002_icu.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_icu_hook.c
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_libc_hook.c

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..261bf5e729 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -15,6 +15,7 @@ SUBDIRS = \
 		  snapshot_too_old \
 		  spgist_name_ops \
 		  test_bloomfilter \
+		  test_collation_lib_hooks \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
 		  test_ddl_deparse \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 911a768a29..b26c5426f9 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -9,6 +9,7 @@ subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
+subdir('test_collation_lib_hooks')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
 subdir('test_ddl_deparse')
diff --git a/src/test/modules/test_collation_lib_hooks/Makefile b/src/test/modules/test_collation_lib_hooks/Makefile
new file mode 100644
index 0000000000..c36e1cb739
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_collation_lib_hooks/Makefile
+
+MODULE_big = test_collation_lib_hooks
+OBJS = \
+	$(WIN32RES) \
+	test_collation_lib_hooks.o test_icu_hook.o test_libc_hook.o
+PGFILEDESC = "test_collation_lib_hooks - test collation provider library hooks"
+
+EXTENSION = test_collation_lib_hooks
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_collation_lib_hooks
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+export with_icu
diff --git a/src/test/modules/test_collation_lib_hooks/meson.build b/src/test/modules/test_collation_lib_hooks/meson.build
new file mode 100644
index 0000000000..9c4b08a9f9
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/meson.build
@@ -0,0 +1,37 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_collation_lib_hooks_sources = files(
+  'test_collation_lib_hooks.c',
+  'test_libc_hook.c',
+  'test_icu_hook.c',
+)
+
+if host_system == 'windows'
+  test_collation_lib_hooks_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_collation_lib_hooks',
+    '--FILEDESC', 'test_collation_lib_hooks - test collation provider library hooks',])
+endif
+
+test_collation_lib_hooks = shared_module('test_collation_lib_hooks',
+  test_collation_lib_hooks_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_collation_lib_hooks
+
+install_data(
+  'test_collation_lib_hooks.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_collation_lib_hooks',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_libc.pl',
+      't/002_icu.pl',
+    ],
+    'env': {'with_icu': icu.found() ? 'yes' : 'no'},
+  },
+}
diff --git a/src/test/modules/test_collation_lib_hooks/t/001_libc.pl b/src/test/modules/test_collation_lib_hooks/t/001_libc.pl
new file mode 100644
index 0000000000..f88c44c085
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/001_libc.pl
@@ -0,0 +1,136 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_reversecase
+			  (PROVIDER=libc, LC_COLLATE='ASC', LC_CTYPE='DESC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_reversesort
+			  (PROVIDER=libc, LC_COLLATE='DESC', LC_CTYPE='ASC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_reversecase']);
+is($version_asc, '3.14159',
+	'collation "test_reversecase" has correct version 3.14159'
+);
+
+my $version_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_reversesort']);
+is($version_desc, '3.14159',
+	'collation "test_reversesort" has correct version 3.14159'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_reversecase < 'wXyZ' COLLATE test_reversecase]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_reversesort < 'wXyZ' COLLATE test_reversesort]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort with trust_strxfrm = false
+
+my $res_sort =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_reversecase]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort (trust_strxfrm = false)'
+);
+
+# test desc sort with trust_strxfrm = false
+
+my $res_reversesort =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_reversesort]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort (trust_strxfrm = false)'
+);
+
+# test asc sort with trust_strxfrm = true
+
+my $res_sort_strxfrm =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_reversecase]);
+is($res_sort_strxfrm, $res_sort_expected,
+	'correct ascending sort (trust_strxfrm = true)'
+);
+
+# test desc sort with trust_strxfrm = true
+
+my $res_reversesort_strxfrm =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_reversesort]);
+is($res_reversesort_strxfrm, $res_reversesort_expected,
+	'correct descending sort (trust_strxfrm = true)'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_reversesort),
+                upper('aBcDfgHiwXyZ' collate test_reversesort)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_reversecase),
+                upper('aBcDfgHiwXyZ' collate test_reversecase)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct lowercase and uppercase'
+);
+
+
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/t/002_icu.pl b/src/test/modules/test_collation_lib_hooks/t/002_icu.pl
new file mode 100644
index 0000000000..cdd7b16d3e
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/002_icu.pl
@@ -0,0 +1,115 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_asc (PROVIDER=icu, LOCALE='ASC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_desc (PROVIDER=icu, LOCALE='DESC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, '2.72',
+	'collation "test_asc" has correct version 2.72'
+);
+
+my $version_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, '2.72',
+	'collation "test_desc" has correct version 2.72'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort
+
+my $res_sort =
+  $node->safe_psql('postgres',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort'
+);
+
+# test desc sort
+
+my $res_reversesort =
+  $node->safe_psql('postgres',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
new file mode 100644
index 0000000000..da532ceb1d
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
@@ -0,0 +1,49 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.c
+ *		Code for testing collation provider library hooks
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
+ *
+ * Test implementations of libc-like and icu-like collation providers.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+
+#include "test_collation_lib_hooks.h"
+
+static get_libc_library_hook_type prev_get_libc_library_hook = NULL;
+#ifdef USE_ICU
+static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
+#endif
+
+PG_MODULE_MAGIC;
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR, (errmsg("test_collation_lib_hooks must be loaded via shared_preload_libraries")));
+
+	prev_get_libc_library_hook = get_libc_library_hook;
+	get_libc_library_hook = test_get_libc_library;
+	
+#ifdef USE_ICU
+	prev_get_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = test_get_icu_library;	
+#endif
+
+	init_libc_hook();
+}
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
new file mode 100644
index 0000000000..a0b8e031a4
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
@@ -0,0 +1,4 @@
+comment = 'Test code for collation provider library hooks'
+default_version = '1.0'
+module_pathname = '$libdir/test_collation_lib_hooks'
+
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
new file mode 100644
index 0000000000..94ea943b97
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
@@ -0,0 +1,32 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.h
+ *		Definitions for collation library hooks.
+ *
+ * Copyright (c) 2015-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#ifndef TEST_COLLATION_LIB_HOOKS_H
+#define TEST_COLLATION_LIB_HOOKS_H
+
+#include "postgres.h"
+
+#include "utils/memutils.h"
+#include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
+
+extern void init_libc_hook(void);
+extern pg_libc_library *test_get_libc_library(const char *collate,
+											  const char *ctype,
+											  const char *version);
+#ifdef USE_ICU
+extern pg_icu_library *test_get_icu_library(const char *locale,
+											const char *version);
+#endif
+
+#endif
diff --git a/src/test/modules/test_collation_lib_hooks/test_icu_hook.c b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
new file mode 100644
index 0000000000..ae257cc03b
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_icu_hook.c
+ *		Code for testing collation provider icu hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_icu_hook.c
+ *
+ * Implements a custom icu-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "en_US" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 2.72, so loading it will cause a version
+ * mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef USE_ICU
+
+#include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
+
+#define TEST_LOCALE "en_US"
+
+typedef struct TestUCollator {
+	UCollator	*ucol;
+	bool		 reverse;
+} TestUCollator;
+
+static pg_icu_library *test_icu_library = NULL;
+static const UVersionInfo test_icu_version = { 2, 72 };
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static UCollator *
+test_openCollator(const char *loc, UErrorCode *status)
+{
+	TestUCollator *testcol = MemoryContextAlloc(TopMemoryContext, sizeof(TestUCollator));
+	UCollator *ucol = ucol_open(TEST_LOCALE, status);
+	testcol->ucol = ucol;
+	testcol->reverse = locale_is_reverse(loc);
+	return (UCollator *)testcol;
+}
+
+static void
+test_closeCollator(UCollator *coll)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_close(testcol->ucol);
+	pfree(testcol);
+}
+
+static void
+test_setAttribute(UCollator *coll, UColAttribute attr,
+				  UColAttributeValue value, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_setAttribute(testcol->ucol, attr, value, status);
+}
+
+static void
+test_getCollatorVersion(const UCollator *coll, UVersionInfo info)
+{
+	memcpy(info, test_icu_version, sizeof(UVersionInfo));
+}
+
+static UCollationResult
+test_strcoll(const UCollator *coll, const UChar *source, int32_t sourceLength,
+			 const UChar *target, int32_t targetLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcoll(testcol->ucol, source, sourceLength,
+										target, targetLength);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static UCollationResult
+test_strcollUTF8(const UCollator *coll, const char *source,
+				 int32_t sourceLength, const char *target,
+				 int32_t targetLength, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcollUTF8(testcol->ucol, source,
+											sourceLength, target,
+											targetLength, status);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static int32_t
+test_getSortKey(const UCollator *coll, const UChar *source,
+				int32_t sourceLength, uint8_t *result, int32_t resultLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_getSortKey(testcol->ucol, source, sourceLength,
+								  result, resultLength);
+	size_t result_size = ret + 1;
+
+	if (resultLength >= result_size)
+	{
+		result[resultLength] = '\0';
+
+		if (testcol->reverse)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) result + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+
+static int32_t
+test_nextSortKeyPart(const UCollator *coll, UCharIterator *iter,
+					 uint32_t state[2], uint8_t *dest, int32_t count,
+					 UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_nextSortKeyPart(testcol->ucol, iter, state, dest,
+									   count, status);
+
+	if (testcol->reverse)
+		for (int i = 0; i < ret; i++)
+			*((unsigned char *) dest + i) ^= (unsigned char) 0xff;
+
+	/*
+	 * The following is not correct for cases where we finish precisely on the
+	 * boundary (i.e. count is exactly enough). To fix this we'd need to track
+	 * additional state across calls, which doesn't seem worth it for a test
+	 * case.
+	 */
+	if (count >= ret && ret > 0)
+	{
+		if (testcol->reverse)
+			dest[ret] = 0xff;
+		else
+			dest[ret] = '\0';
+		return ret + 1;
+	}
+
+	return ret;
+}
+
+static int32_t
+test_strToUpper(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+static int32_t
+test_strToLower(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+pg_icu_library *
+test_get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib;
+
+	if (test_icu_library != NULL)
+		return test_icu_library;
+
+	ereport(LOG, (errmsg("loading custom ICU provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = test_openCollator;
+	lib->closeCollator = test_closeCollator;
+	lib->getCollatorVersion = test_getCollatorVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = test_strcoll;
+	lib->strcollUTF8 = test_strcollUTF8;
+	lib->getSortKey = test_getSortKey;
+	lib->nextSortKeyPart = test_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = test_strToUpper;
+	lib->strToLower = test_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = test_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	test_icu_library = lib;
+	return lib;
+}
+
+#endif				/* USE_ICU */
diff --git a/src/test/modules/test_collation_lib_hooks/test_libc_hook.c b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
new file mode 100644
index 0000000000..dede3ccf64
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
@@ -0,0 +1,413 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_libc_hook.c
+ *		Code for testing collation provider libc hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_libc_hook.c
+ *
+ * Implements a custom libc-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "C" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 3.14159, so loading it will cause a
+ * version mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef __GLIBC__
+#include <gnu/libc-version.h>
+#endif
+
+#ifdef WIN32
+#include <shlwapi.h>
+#endif
+
+#define TEST_LIBC_VERSION "3.14159"
+#define LOCALE_NAME_LEN 64
+
+typedef struct test_locale_t
+{
+	bool reverse_sort;
+	bool reverse_case;
+	char lc_collate[LOCALE_NAME_LEN];
+	char lc_ctype[LOCALE_NAME_LEN];
+} test_locale_t;
+
+static pg_libc_library *test_libc_library = NULL;
+static test_locale_t current_setlocale = { .lc_collate = "C", .lc_ctype = "C" };
+static test_locale_t *current_uselocale = (test_locale_t *)LC_GLOBAL_LOCALE;
+
+#ifdef HAVE_LOCALE_T
+static locale_t c_locale_t = NULL;
+#endif
+
+void
+init_libc_hook()
+{
+	c_locale_t = newlocale(LC_ALL_MASK, "C", NULL);
+}
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static const char *
+test_libc_version()
+{
+	return TEST_LIBC_VERSION;
+}
+
+static char *
+test_setlocale(int category, const char *locale)
+{
+	if (category == LC_ALL)
+	{
+		if (locale)
+		{
+			if (locale_is_reverse(locale))
+			{
+				current_setlocale.reverse_sort = true;
+				current_setlocale.reverse_case = true;
+			}
+			else
+			{
+				current_setlocale.reverse_sort = false;
+				current_setlocale.reverse_case = false;
+			}
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+		}
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_COLLATE)
+	{
+		if (locale)
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_CTYPE)
+	{
+		if (locale)
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+
+		return current_setlocale.lc_ctype;
+	}
+	else
+		Assert(false);
+}
+
+#ifdef HAVE_LOCALE_T
+
+static locale_t
+test_newlocale(int category, const char *locale, locale_t baselocale_t)
+{
+	test_locale_t *newloc;
+
+	if (baselocale_t == NULL)
+	{
+		newloc = MemoryContextAlloc(TopMemoryContext,
+									sizeof(test_locale_t));
+		strncpy(newloc->lc_collate, "C", LOCALE_NAME_LEN);
+		strncpy(newloc->lc_ctype, "C", LOCALE_NAME_LEN);
+	}
+	else
+		newloc = (test_locale_t *) baselocale_t;
+
+	if (category == LC_ALL_MASK || category == LC_COLLATE_MASK)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_sort = true;
+		else
+			newloc->reverse_sort = false;
+		strncpy(newloc->lc_collate, locale, LOCALE_NAME_LEN);
+	}
+	if (category == LC_ALL_MASK || category == LC_CTYPE_MASK)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_case = true;
+		else
+			newloc->reverse_case = false;
+		strncpy(newloc->lc_ctype, locale, LOCALE_NAME_LEN);
+	}
+
+	return (locale_t) newloc;
+}
+
+#ifndef WIN32
+static void
+test_freelocale(locale_t loc)
+{
+	pfree(loc);
+}
+#endif
+
+#ifdef WIN32
+static locale_t
+_test_create_locale(int category, const char *locale)
+{
+	return test_newlocale(category, locale, NULL);
+}
+#endif
+
+static locale_t
+test_uselocale(locale_t loc)
+{
+	test_locale_t *result = current_uselocale;
+
+	if (loc != NULL)
+		current_uselocale = (test_locale_t *) loc;
+
+	return (locale_t) result;
+}
+#endif			/* HAVE_LOCALE_T */
+
+static int
+test_strcoll(const char *s1, const char *s2)
+{
+	char			*save		= pstrdup(setlocale(LC_COLLATE, NULL));
+	int				 ret;
+
+	setlocale(LC_COLLATE, "C");
+	ret = strcoll(s1, s2);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	if (current_setlocale.reverse_sort)
+		return -ret;
+	else
+		return ret;
+}
+
+static int
+test_wcscoll(const wchar_t *ws1, const wchar_t *ws2)
+{
+	char			*save		= pstrdup(setlocale(LC_COLLATE, NULL));
+	int				 ret;
+
+	setlocale(LC_COLLATE, "C");
+	ret = wcscoll(ws1, ws2);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	if (current_setlocale.reverse_sort)
+		return -ret;
+	else
+		return ret;
+}
+
+static size_t
+test_strxfrm(char *s1, const char * s2, size_t n)
+{
+	char			*save		 = pstrdup(setlocale(LC_COLLATE, NULL));
+	int				 ret;
+	size_t			 result_size;
+
+	setlocale(LC_COLLATE, "C");
+	ret = strxfrm(s1, s2, n);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	result_size = ret + 1;
+
+	if (n >= result_size)
+	{
+		s1[ret] = '\0';
+
+		if (current_setlocale.reverse_sort)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) s1 + i) ^= (char) 0xff;
+	}
+
+	return result_size;
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_strcoll_l(const char *s1, const char *s2, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *)loc;
+	int ret = strcoll_l(s1, s2, c_locale_t);
+
+	if (testlocale->reverse_sort)
+		return -ret;
+	else
+		return ret;
+}
+
+static int
+test_wcscoll_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	int ret = wcscoll_l(ws1, ws2, c_locale_t);
+
+	if (testlocale->reverse_sort)
+		return -ret;
+	else
+		return ret;
+}
+
+static size_t
+test_strxfrm_l(char *s1, const char * s2, size_t n, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *)loc;
+	size_t ret = strxfrm_l(s1, s2, n, c_locale_t);
+	size_t result_size = ret + 1;
+
+	if (n >= result_size)
+	{
+		s1[ret] = '\0';
+
+		if (testlocale->reverse_sort)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) s1 + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+#endif			 /* HAVE_LOCALE_T */
+
+static int
+test_iswalnum(wint_t wc)
+{
+	char			*save		= pstrdup(setlocale(LC_COLLATE, NULL));
+	int				 ret;
+
+	setlocale(LC_COLLATE, "C");
+	ret = iswalnum(wc);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	return ret;
+}
+
+static wint_t
+test_towlower(wint_t wc)
+{
+	char			*save		= pstrdup(setlocale(LC_COLLATE, NULL));
+	wint_t			 ret;
+
+	setlocale(LC_COLLATE, "C");
+	if (current_setlocale.reverse_case)
+		ret = towupper(wc);
+	else
+		ret = towlower(wc);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	return ret;
+}
+
+static wint_t
+test_towupper(wint_t wc)
+{
+	char			*save		= pstrdup(setlocale(LC_COLLATE, NULL));
+	wint_t			 ret;
+
+	setlocale(LC_COLLATE, "C");
+	if (current_setlocale.reverse_case)
+		ret = towlower(wc);
+	else
+		ret = towupper(wc);
+	setlocale(LC_COLLATE, save);
+	pfree(save);
+
+	return ret;
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_iswalnum_l(wint_t wc, locale_t locale)
+{
+	return iswalnum_l(wc, c_locale_t);
+}
+
+static wint_t
+test_towlower_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+
+	if (testlocale->reverse_case)
+		return towupper_l(wc, c_locale_t);
+	else
+		return towlower_l(wc, c_locale_t);
+}
+
+static wint_t
+test_towupper_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+
+	if (testlocale->reverse_case)
+		return towlower_l(wc, c_locale_t);
+	else
+		return towupper_l(wc, c_locale_t);
+}
+#endif			 /* HAVE_LOCALE_T */
+
+pg_libc_library *
+test_get_libc_library(const char *collate, const char *ctype,
+					  const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (test_libc_library != NULL)
+		return test_libc_library;
+
+	ereport(LOG, (errmsg("loading custom libc provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_libc_library));
+	lib->libc_version = test_libc_version;
+	lib->c_setlocale = test_setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = test_newlocale;
+	lib->c_freelocale = test_freelocale;
+	lib->c_uselocale = test_uselocale;
+#else
+	lib->_create_locale = _test_create_locale;
+#endif
+#endif
+	lib->c_wcstombs = wcstombs;
+	lib->c_mbstowcs = mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = test_strcoll;
+	lib->c_wcscoll = test_wcscoll;
+	lib->c_strxfrm = test_strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = test_strcoll_l;
+	lib->c_wcscoll_l = test_wcscoll_l;
+	lib->c_strxfrm_l = test_strxfrm_l;
+#endif
+	lib->c_iswalnum = test_iswalnum;
+	lib->c_towlower = test_towlower;
+	lib->c_towupper = test_towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_iswalnum_l = test_iswalnum_l;
+	lib->c_towlower_l = test_towlower_l;
+	lib->c_towupper_l = test_towupper_l;
+#endif
+
+	test_libc_library = lib;
+	return lib;
+}
-- 
2.34.1

Ted Yu

yuzhihong@gmail.com

about 3 years ago

In reply to: Jeff Davis (#1)

Re: Rework of collation code, extensibility

On Sat, Dec 17, 2022 at 7:14 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2] in
a new thread given its new scope.

Benefits:

1. Clearer division of responsibilities.
2. More consistent between libc and ICU providers.
3. Hooks that allow extensions to replace collation provider libraries.
4. New tests for the collation provider library hooks.

There are a lot of changes, and still some loose ends, but I believe a
few of these patches are close to ready.

This set of changes does not express an opinion on how we might want to
support multiple provider libraries in core; but whatever we choose, it
should be easier to accomplish. Right now, the hooks have limited
information on which to make the choice for a specific version of a
collation provider library, but that's because there's limited
information in the catalog. If the discussion here[3] concludes in
adding collation provider library or library version information to the
catalog, we can add additional parameters to the hooks.

[1]

/messages/by-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
[2]

/messages/by-id/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
[3]

/messages/by-id/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Hi,

For pg_strxfrm_libc in v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch:

+#ifdef HAVE_LOCALE_T
+       if (locale)
+               return strxfrm_l(dest, src, destsize, locale->info.lt);
+       else
+#endif
+               return strxfrm(dest, src, destsize);

It seems the `else` is not needed (since when the if branch is taken, we
return from the func).

+ /* nul-terminate arguments */

nul-terminate -> null-terminate

For pg_strnxfrm(), I think `result` can be removed - we directly return the
result from pg_strnxfrm_libc or pg_strnxfrm_icu.

Cheers

John Naylor

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Ted Yu (#2)

Re: Rework of collation code, extensibility

On Sun, Dec 18, 2022 at 10:28 AM Ted Yu <yuzhihong@gmail.com> wrote:

It seems the `else` is not needed (since when the if branch is taken, we

return from the func).

By that same logic, this review comment is not needed, since compiler
vendors don't charge license fees by the number of keywords. ;-)
Joking aside, we don't really have a project style preference for this case.

nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero pointers).
See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com

Ted Yu

yuzhihong@gmail.com

about 3 years ago

In reply to: John Naylor (#3)

Re: Rework of collation code, extensibility

On Sat, Dec 17, 2022 at 8:54 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

nul-terminate -> null-terminate

NUL is a common abbreviation for the zero byte (but not for zero
pointers). See the ascii manpage.

--
John Naylor
EDB: http://www.enterprisedb.com

Ah.

`nul-terminated` does appear in the codebase.
Should have checked earlier.

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#1)

7 attachment(s)

Re: Rework of collation code, extensibility

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v5-0007-Add-test-module-for-libc-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v5-0007-Add-test-module-for-libc-collation-provider-hook.patchDownload

From 9cc37a73f7094a871662835074a774acc1205d1e Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:52:23 -0800
Subject: [PATCH v5 7/7] Add test module for libc collation provider hook.

---
 .../modules/test_collation_lib_hooks/Makefile |  10 +-
 .../test_collation_lib_hooks/meson.build      |  10 +
 .../test_collation_lib_hooks/t/002_libc.pl    | 157 +++++
 .../t/003_libc_mixed_collation.pl             |  66 +++
 .../t/004_libc_version.pl                     |  66 +++
 .../test_collation_lib_hooks.c                |   8 +-
 .../test_collation_lib_hooks.h                |   4 +
 .../test_collation_lib_hooks/test_libc_hook.c | 555 ++++++++++++++++++
 8 files changed, 874 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/002_libc.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_libc_hook.c

diff --git a/src/test/modules/test_collation_lib_hooks/Makefile b/src/test/modules/test_collation_lib_hooks/Makefile
index 05948e555a..a800c9ea74 100644
--- a/src/test/modules/test_collation_lib_hooks/Makefile
+++ b/src/test/modules/test_collation_lib_hooks/Makefile
@@ -3,7 +3,7 @@
 MODULE_big = test_collation_lib_hooks
 OBJS = \
 	$(WIN32RES) \
-	test_collation_lib_hooks.o test_icu_hook.o
+	test_collation_lib_hooks.o test_icu_hook.o test_libc_hook.o
 PGFILEDESC = "test_collation_lib_hooks - test collation provider library hooks"
 
 EXTENSION = test_collation_lib_hooks
@@ -22,3 +22,11 @@ include $(top_srcdir)/contrib/contrib-global.mk
 endif
 
 export with_icu
+
+# TODO set environment variables:
+#  determine if built with glibc
+#  determine if LC_VERSION_MASK is defined
+
+ifeq ($(PORTNAME), win32)
+	export win32 = yes
+endif
diff --git a/src/test/modules/test_collation_lib_hooks/meson.build b/src/test/modules/test_collation_lib_hooks/meson.build
index 56b32b6cd1..a588751cc1 100644
--- a/src/test/modules/test_collation_lib_hooks/meson.build
+++ b/src/test/modules/test_collation_lib_hooks/meson.build
@@ -2,6 +2,7 @@
 
 test_collation_lib_hooks_sources = files(
   'test_collation_lib_hooks.c',
+  'test_libc_hook.c',
   'test_icu_hook.c',
 )
 
@@ -22,6 +23,9 @@ install_data(
   kwargs: contrib_data_args,
 )
 
+have_lc_version_mask = cc.has_header_symbol('locale.h', 'LC_VERSION_MASK')
+glibc = cc.has_header_symbol('features.h', '__GLIBC__')
+
 tests += {
   'name': 'test_collation_lib_hooks',
   'sd': meson.current_source_dir(),
@@ -29,9 +33,15 @@ tests += {
   'tap': {
     'tests': [
       't/001_icu.pl',
+      't/002_libc.pl',
+      't/003_libc_mixed_collation.pl',
+      't/004_libc_version.pl',
     ],
     'env': {
       'with_icu': icu.found() ? 'yes' : 'no',
+      'win32': (host_system == 'windows') ? 'yes' : 'no',
+      'glibc': glibc ? 'yes' : 'no',
+      'have_lc_version_mask': have_lc_version_mask ? 'yes' : 'no',
     },
   },
 }
diff --git a/src/test/modules/test_collation_lib_hooks/t/002_libc.pl b/src/test/modules/test_collation_lib_hooks/t/002_libc.pl
new file mode 100644
index 0000000000..e43a0916d3
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/002_libc.pl
@@ -0,0 +1,157 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_asc (PROVIDER=libc, LOCALE='ASC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_desc (PROVIDER=libc, LOCALE='DESC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+my $sort_asc_expected = "aBcD
+fGhI
+wXyZ";
+
+my $sort_desc_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison_asc, 't',
+	'correct comparison'
+);
+
+# test desc comparison
+
+my $comparison_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_desc, 'f',
+	'correct desc comparison'
+);
+
+# test asc sort with trust_strxfrm = false
+
+my $sort_asc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($sort_asc, $sort_asc_expected,
+	'correct ascending sort (trust_strxfrm = false)'
+);
+
+# test desc sort with trust_strxfrm = false
+
+my $sort_desc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($sort_desc, $sort_desc_expected,
+	'correct descending sort (trust_strxfrm = false)'
+);
+
+# test asc sort with trust_strxfrm = true
+
+my $strxfrm_asc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($strxfrm_asc, $sort_asc_expected,
+	'correct ascending sort (trust_strxfrm = true)'
+);
+
+# test desc sort with trust_strxfrm = true
+
+my $strxfrm_desc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($strxfrm_desc, $sort_desc_expected,
+	'correct descending sort (trust_strxfrm = true)'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test desc lower/upper
+
+my $tcase_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_desc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct desc lowercase and uppercase'
+);
+
+if ($ENV{win32} ne 'yes') {
+  $node->safe_psql('postgres',
+     qq[CREATE COLLATION test_mixed_asc_desc
+         (PROVIDER=libc, LC_COLLATE='ASC', LC_CTYPE='DESC')]);
+  $node->safe_psql('postgres',
+     qq[CREATE COLLATION test_mixed_desc_asc
+         (PROVIDER=libc, LC_COLLATE='DESC', LC_CTYPE='ASC')]);
+
+  my $mcomparison_asc =
+    $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_mixed_asc_desc <
+                'wXyZ' COLLATE test_mixed_asc_desc]);
+  is($mcomparison_asc, 't',
+	'correct mixed asc/desc comparison'
+  );
+
+  my $mcomparison_desc =
+    $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_mixed_desc_asc <
+                'wXyZ' COLLATE test_mixed_desc_asc]);
+  is($mcomparison_desc, 'f',
+	'correct mixed desc/asc comparison'
+  );
+
+  my $mcase_asc =
+    $node->safe_psql('postgres',
+	    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_asc_desc),
+                  upper('aBcDfgHiwXyZ' collate test_mixed_asc_desc)]);
+  is($mcase_asc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+    'correct case mixed asc/desc'
+  );
+
+  my $mcase_desc =
+    $node->safe_psql('postgres',
+	    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_desc_asc),
+                  upper('aBcDfgHiwXyZ' collate test_mixed_desc_asc)]);
+  is($mcase_desc, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+    'correct case mixed desc/asc'
+  );
+}
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl b/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
new file mode 100644
index 0000000000..2ef18f797c
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test mixed collations with differing lc_collate/lc_ctype
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{win32} eq 'yes')
+{
+	plan skip_all => 'windows does not support mixed libc collations';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+   qq[CREATE COLLATION test_mixed_asc_desc
+         (PROVIDER=libc, LC_COLLATE='ASC', LC_CTYPE='DESC')]);
+$node->safe_psql('postgres',
+   qq[CREATE COLLATION test_mixed_desc_asc
+         (PROVIDER=libc, LC_COLLATE='DESC', LC_CTYPE='ASC')]);
+
+my $mcomparison_asc =
+  $node->safe_psql('postgres',
+    qq[SELECT 'aBcD' COLLATE test_mixed_asc_desc <
+         'wXyZ' COLLATE test_mixed_asc_desc]);
+is($mcomparison_asc, 't',
+  'correct mixed asc/desc comparison'
+);
+
+my $mcomparison_desc =
+  $node->safe_psql('postgres',
+    qq[SELECT 'aBcD' COLLATE test_mixed_desc_asc <
+              'wXyZ' COLLATE test_mixed_desc_asc]);
+is($mcomparison_desc, 'f',
+'correct mixed desc/asc comparison'
+);
+
+my $mcase_asc =
+  $node->safe_psql('postgres',
+    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_asc_desc),
+              upper('aBcDfgHiwXyZ' collate test_mixed_asc_desc)]);
+is($mcase_asc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+  'correct case mixed asc/desc'
+);
+
+my $mcase_desc =
+  $node->safe_psql('postgres',
+    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_desc_asc),
+              upper('aBcDfgHiwXyZ' collate test_mixed_desc_asc)]);
+is($mcase_desc, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+  'correct case mixed desc/asc'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl b/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
new file mode 100644
index 0000000000..201fa98e6b
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test collation versions (platform-specific)
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $version_asc_expected;
+my $version_desc_expected;
+
+if ($ENV{glibc} eq 'yes') {
+	$version_asc_expected = '3.14159';
+	$version_desc_expected = '3.14159';
+} elsif ($ENV{have_lc_version_mask} eq 'yes') {
+	$version_asc_expected = '3.14';
+	$version_desc_expected = '6.28';
+} elsif ($ENV{win32} eq 'yes') {
+	$version_asc_expected = '3.14,3.14';
+	$version_desc_expected = '6.28,6.28';
+} else {
+	plan skip_all => 'platform does not support libc collation versions';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_asc (PROVIDER=libc, LOCALE='ASC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_desc (PROVIDER=libc, LOCALE='DESC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $pg_version = $node->safe_psql('postgres', qq[SELECT version()]);
+
+my $version_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, $version_asc_expected,
+	"collation test_asc has correct version $version_asc_expected"
+);
+
+my $version_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, $version_desc_expected,
+	"collation test_desc has correct version $version_desc_expected"
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
index 599ec61239..5650840b4d 100644
--- a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
@@ -9,7 +9,7 @@
  * IDENTIFICATION
  *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
  *
- * Test implementation of icu-like collation provider.
+ * Test implementations of libc-like and icu-like collation providers.
  *
  * -------------------------------------------------------------------------
  */
@@ -21,6 +21,7 @@
 
 #include "test_collation_lib_hooks.h"
 
+static get_libc_library_hook_type prev_get_libc_library_hook = NULL;
 #ifdef USE_ICU
 static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
 #endif
@@ -36,8 +37,13 @@ _PG_init(void)
 	if (!process_shared_preload_libraries_in_progress)
 		ereport(ERROR, (errmsg("test_collation_lib_hooks must be loaded via shared_preload_libraries")));
 
+	prev_get_libc_library_hook = get_libc_library_hook;
+	get_libc_library_hook = test_get_libc_library;
+
 #ifdef USE_ICU
 	prev_get_icu_library_hook = get_icu_library_hook;
 	get_icu_library_hook = test_get_icu_library;
 #endif
+
+	init_libc_hook();
 }
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
index e6ee457ab3..94ea943b97 100644
--- a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
@@ -20,6 +20,10 @@
 #include "utils/pg_locale.h"
 #include "utils/pg_locale_internal.h"
 
+extern void init_libc_hook(void);
+extern pg_libc_library *test_get_libc_library(const char *collate,
+											  const char *ctype,
+											  const char *version);
 #ifdef USE_ICU
 extern pg_icu_library *test_get_icu_library(const char *locale,
 											const char *version);
diff --git a/src/test/modules/test_collation_lib_hooks/test_libc_hook.c b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
new file mode 100644
index 0000000000..a8eb5d8dd2
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
@@ -0,0 +1,555 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_libc_hook.c
+ *		Code for testing collation provider libc hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_libc_hook.c
+ *
+ * Implements a custom libc-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "C" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 3.14159, so loading it will cause a
+ * version mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef __GLIBC__
+#include <gnu/libc-version.h>
+#endif
+
+#ifdef WIN32
+#include <shlwapi.h>
+#endif
+
+#define LOCALE_NAME_LEN 64
+
+typedef struct test_locale_t
+{
+	bool reverse_sort;
+	bool reverse_case;
+	char lc_collate[LOCALE_NAME_LEN];
+	char lc_ctype[LOCALE_NAME_LEN];
+} test_locale_t;
+
+static pg_libc_library *test_libc_library = NULL;
+static test_locale_t current_setlocale = {
+	.lc_collate = "C",
+	.lc_ctype = "C",
+	.reverse_sort = false,
+	.reverse_case = false
+};
+
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+static test_locale_t *current_uselocale = &current_setlocale;
+#endif
+#endif
+
+#ifdef HAVE_LOCALE_T
+static locale_t c_locale_t = NULL;
+#endif
+
+void
+init_libc_hook()
+{
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	c_locale_t = newlocale(LC_ALL_MASK, "C", NULL);
+#else
+	c_locale_t = _create_locale(LC_ALL, "C");
+#endif
+#endif
+}
+
+#ifdef HAVE_LOCALE_T
+static test_locale_t *
+current_locale(void)
+{
+#ifndef WIN32
+	return current_uselocale;
+#else
+	return &current_setlocale;
+#endif
+}
+#endif
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static const char *
+test_libc_version(void)
+{
+	return "3.14159";
+}
+
+#ifdef WIN32
+bool
+test_GetNLSVersionEx(NLS_FUNCTION function, LPCWSTR lpLocaleName,
+					 LPNLSVERSIONINFOEX lpVersionInformation)
+{
+	Assert(function == COMPARE_STRING);
+	if (wcscmp(lpLocaleName, L"DESC") == 0)
+	{
+		lpVersionInformation->dwNLSVersion = (6 << 8) | 28;
+		lpVersionInformation->dwDefinedVersion = (6 << 8) | 28;
+	}
+	else
+	{
+		lpVersionInformation->dwNLSVersion = (3 << 8) | 14;
+		lpVersionInformation->dwDefinedVersion = (3 << 8) | 14;
+	}
+
+	return true;
+}
+#endif
+
+static char *
+test_setlocale(int category, const char *locale)
+{
+	Assert (category == LC_COLLATE || category == LC_CTYPE ||
+			category == LC_ALL);
+
+	if (category == LC_ALL)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_sort = locale_is_reverse(locale);
+			current_setlocale.reverse_case = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+		}
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_COLLATE)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_sort = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+		}
+
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_CTYPE)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_case = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+		}
+
+		return current_setlocale.lc_ctype;
+	}
+
+	return NULL;
+}
+
+#ifdef HAVE_LOCALE_T
+
+#ifndef WIN32
+
+static locale_t
+test_newlocale(int category, const char *locale, locale_t baselocale_t)
+{
+	test_locale_t *newloc;
+
+	Assert(baselocale_t != LC_GLOBAL_LOCALE);
+	Assert((test_locale_t *) baselocale_t != &current_setlocale);
+
+	if (baselocale_t == NULL)
+	{
+		newloc = MemoryContextAlloc(TopMemoryContext,
+									sizeof(test_locale_t));
+		strncpy(newloc->lc_collate, "C", LOCALE_NAME_LEN);
+		strncpy(newloc->lc_ctype, "C", LOCALE_NAME_LEN);
+		newloc->reverse_sort = false;
+		newloc->reverse_case = false;
+	}
+	else
+		newloc = (test_locale_t *) baselocale_t;
+
+	if ((category & LC_COLLATE_MASK) != 0)
+	{
+		newloc->reverse_sort = locale_is_reverse(locale);
+		strncpy(newloc->lc_collate, locale, LOCALE_NAME_LEN);
+	}
+	if ((category & LC_CTYPE_MASK) != 0)
+	{
+		newloc->reverse_case = locale_is_reverse(locale);
+		strncpy(newloc->lc_ctype, locale, LOCALE_NAME_LEN);
+	}
+
+	return (locale_t) newloc;
+}
+
+static void
+test_freelocale(locale_t loc)
+{
+	Assert(loc != LC_GLOBAL_LOCALE);
+	Assert((test_locale_t *)loc != &current_setlocale);
+	pfree(loc);
+}
+
+static locale_t
+test_uselocale(locale_t loc)
+{
+	test_locale_t *result = current_uselocale;
+
+	if (loc != NULL)
+	{
+		if (loc == LC_GLOBAL_LOCALE)
+			current_uselocale = &current_setlocale;
+		else
+			current_uselocale = (test_locale_t *) loc;
+	}
+
+	if (result == &current_setlocale)
+		return LC_GLOBAL_LOCALE;
+	else
+		return (locale_t) result;
+}
+
+#ifdef LC_VERSION_MASK
+static const char *
+test_querylocale(int mask, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *)locale;
+	Assert((mask & LC_VERSION_MASK) != 0);
+	if (testlocale->reverse_sort)
+		return "6.28";
+	else
+		return "3.14";
+}
+#endif			/* LC_VERSION_MASK */
+
+#else			/* WIN32 */
+static locale_t
+_test_create_locale(int category, const char *locale)
+{
+	test_locale_t *newloc;
+
+	newloc = MemoryContextAlloc(TopMemoryContext,
+								sizeof(test_locale_t));
+	strncpy(newloc->lc_collate, "C", LOCALE_NAME_LEN);
+	strncpy(newloc->lc_ctype, "C", LOCALE_NAME_LEN);
+	newloc->reverse_sort = false;
+	newloc->reverse_case = false;
+
+	if (category == LC_ALL || category == LC_COLLATE)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_sort = true;
+		else
+			newloc->reverse_sort = false;
+		strncpy(newloc->lc_collate, locale, LOCALE_NAME_LEN);
+	}
+	if (category == LC_ALL || category == LC_CTYPE)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_case = true;
+		else
+			newloc->reverse_case = false;
+		strncpy(newloc->lc_ctype, locale, LOCALE_NAME_LEN);
+	}
+
+	return (locale_t) newloc;
+}
+#endif			/* WIN32 */
+
+#endif			/* HAVE_LOCALE_T */
+
+static size_t
+test_wcstombs(char *dest, const wchar_t *src, size_t n)
+{
+	return wcstombs(dest, src, n);
+}
+
+static size_t
+test_mbstowcs(wchar_t *dest, const char *src, size_t n)
+{
+	return mbstowcs(dest, src, n);
+}
+
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+static size_t
+test_wcstombs_l(char *dest, const wchar_t *src, size_t n, locale_t loc)
+{
+	return wcstombs(dest, src, n);
+}
+#endif
+#ifdef HAVE_MBSTOWCS_L
+static size_t
+test_mbstowcs_l(wchar_t *dest, const char *src, size_t n, locale_t loc)
+{
+	return mbstowcs(dest, src, n);
+}
+#endif
+#endif
+
+static int
+test_strcoll_internal(const char *s1, const char *s2, bool reverse)
+{
+	int ret = strcmp(s1, s2);
+	return reverse ? -ret : ret;
+}
+
+static int
+test_strcoll(const char *s1, const char *s2)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_strcoll_internal(s1, s2, reverse);
+}
+
+static int
+test_wcscoll_internal(const wchar_t *ws1, const wchar_t *ws2, bool reverse)
+{
+	int ret = wcscmp(ws1, ws2);
+	return reverse ? -ret : ret;
+}
+static int
+test_wcscoll(const wchar_t *ws1, const wchar_t *ws2)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_wcscoll_internal(ws1, ws2, reverse);
+}
+
+static size_t
+test_strxfrm_internal(char *s1, const char *s2, size_t n, bool reverse)
+{
+	size_t			 result_size = strlen(s2) + 1;
+
+	if (n > result_size)
+	{
+		strncpy(s1, s2, n);
+		s1[result_size] = '\0';
+
+		if (reverse)
+		{
+			unsigned char *dest = (unsigned char *)s1;
+			for (int i = 0; i < result_size; i++)
+				dest[i] ^= (unsigned char) 0xFF;
+		}
+	}
+
+	return result_size;
+}
+
+static size_t
+test_strxfrm(char *s1, const char * s2, size_t n)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_strxfrm_internal(s1, s2, n, reverse);
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_strcoll_l(const char *s1, const char *s2, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *)loc;
+	bool reverse = testlocale->reverse_sort;
+	return test_strcoll_internal(s1, s2, reverse);
+}
+
+static int
+test_wcscoll_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_sort;
+	return test_wcscoll_internal(ws1, ws2, reverse);
+}
+
+static size_t
+test_strxfrm_l(char *s1, const char * s2, size_t n, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *) loc;
+	bool reverse = testlocale->reverse_sort;
+	return test_strxfrm_internal(s1, s2, n, reverse);
+}
+#endif			 /* HAVE_LOCALE_T */
+
+static int
+test_case_internal(int c, bool toupper)
+{
+	if (toupper && ('a' <= c && c <= 'z'))
+		return c - ('a' - 'A');
+	else if (!toupper && ('A' <= c && c <= 'Z'))
+		return c + ('a' - 'A');
+	else
+		return c;
+}
+
+static int
+test_tolower(int c)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_case_internal(c, reverse ? true : false);
+}
+
+static int
+test_toupper(int c)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_case_internal(c, reverse ? false : true);
+}
+
+static int
+test_iswalnum_internal(wint_t wc)
+{
+	if (('A' <= wc && wc <= 'Z') ||
+		('a' <= wc && wc <= 'z') ||
+		('0' <= wc && wc <= '9'))
+		return 1;
+	return 0;
+}
+
+static int
+test_iswalnum(wint_t wc)
+{
+	return test_iswalnum_internal(wc);
+}
+
+static wint_t
+test_wcase_internal(wint_t wc, bool toupper)
+{
+	if (toupper && ('a' <= wc && wc <= 'z'))
+		return wc - ('a' - 'A');
+	else if (!toupper && ('A' <= wc && wc <= 'Z'))
+		return wc + ('a' - 'A');
+	else
+		return wc;
+}
+
+static wint_t
+test_towlower(wint_t wc)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_wcase_internal(wc, reverse ? true : false);
+}
+
+static wint_t
+test_towupper(wint_t wc)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_wcase_internal(wc, reverse ? false : true);
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_tolower_l(int c, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_case_internal(c, reverse ? true : false);
+}
+
+static int
+test_toupper_l(int c, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_case_internal(c, reverse ? false : true);
+}
+
+static int
+test_iswalnum_l(wint_t wc, locale_t locale)
+{
+	return test_iswalnum_internal(wc);
+}
+
+static wint_t
+test_towlower_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_wcase_internal(wc, reverse ? true : false);
+}
+
+static wint_t
+test_towupper_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_wcase_internal(wc, reverse ? false : true);
+}
+#endif			 /* HAVE_LOCALE_T */
+
+pg_libc_library *
+test_get_libc_library(const char *collate, const char *ctype,
+					  const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (test_libc_library != NULL)
+		return test_libc_library;
+
+	ereport(LOG, (errmsg("loading custom libc provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_libc_library));
+#if defined(__GLIBC__)
+	lib->libc_version = test_libc_version;
+#elif defined(WIN32)
+	lib->GetNLSVersionEx = test_GetNLSVersionEx;
+#endif
+	lib->c_setlocale = test_setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = test_newlocale;
+	lib->c_freelocale = test_freelocale;
+	lib->c_uselocale = test_uselocale;
+#ifdef LC_VERSION_MASK
+	lib->c_querylocale = test_querylocale;
+#endif
+#else
+	lib->_create_locale = _test_create_locale;
+#endif
+#endif
+	lib->c_wcstombs = test_wcstombs;
+	lib->c_mbstowcs = test_mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = test_wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = test_mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = test_strcoll;
+	lib->c_wcscoll = test_wcscoll;
+	lib->c_strxfrm = test_strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = test_strcoll_l;
+	lib->c_wcscoll_l = test_wcscoll_l;
+	lib->c_strxfrm_l = test_strxfrm_l;
+#endif
+	lib->c_tolower = test_tolower;
+	lib->c_toupper = test_toupper;
+	lib->c_iswalnum = test_iswalnum;
+	lib->c_towlower = test_towlower;
+	lib->c_towupper = test_towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_tolower_l = test_tolower_l;
+	lib->c_toupper_l = test_toupper_l;
+	lib->c_iswalnum_l = test_iswalnum_l;
+	lib->c_towlower_l = test_towlower_l;
+	lib->c_towupper_l = test_towupper_l;
+#endif
+
+	test_libc_library = lib;
+	return lib;
+}
-- 
2.34.1

v5-0006-Support-multiple-libc-collation-provider-librarie.patchtext/x-patch; charset=UTF-8; name=v5-0006-Support-multiple-libc-collation-provider-librarie.patchDownload

From 83a4f4d1a4a885075488f400cc2b7249495c560a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 14 Dec 2022 15:12:35 -0800
Subject: [PATCH v5 6/7] Support multiple libc collation provider libraries.

Introduce structure pg_libc_library, which holds pointers to each
required libc method related to locale_t, encoding, collation, or
ctype. Store this as part of pg_locale_t. Each call to one of these
libc functions instead goes through this structure, so that it can
more easily be replaced by a non-builtin libc library.

Offer a hook to allow an extension to control which libc library
is loaded.

Adds a test module to test the hook functionality.

TODO: loose ends in regex code and possibly other places where libc
functions are called directly.

Author: Jeff Davis
---
 src/backend/utils/adt/formatting.c     |  50 +++++--
 src/backend/utils/adt/pg_locale.c      | 199 +++++++++++++++++++------
 src/include/utils/pg_locale_internal.h |  85 ++++++++++-
 3 files changed, 274 insertions(+), 60 deletions(-)

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 7f318fe2c7..e9020fef29 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -69,7 +69,6 @@
 #include <math.h>
 #include <float.h>
 #include <limits.h>
-#include <wctype.h>
 
 #ifdef USE_ICU
 #include <unicode/ustring.h>
@@ -1729,10 +1728,16 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towlower(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1762,7 +1767,11 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						*p = libc->c_tolower_l((unsigned char) *p,
+											   mylocale->info.libc.lt);
+					}
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1852,10 +1861,16 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towupper(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1885,7 +1900,11 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						*p = libc->c_toupper_l((unsigned char) *p,
+											   mylocale->info.libc.lt);
+					}
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1977,19 +1996,21 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
 					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
 					{
+						pg_libc_library *libc = get_default_libc_library();
 						if (wasalnum)
-							workspace[curr_char] = towlower(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
 						else
-							workspace[curr_char] = towupper(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
 						wasalnum = iswalnum(workspace[curr_char]);
 					}
 				}
@@ -2022,10 +2043,13 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
 					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
+							*p = libc->c_tolower_l((unsigned char) *p,
+												   mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+							*p = libc->c_toupper_l((unsigned char) *p,
+												   mylocale->info.libc.lt);
 						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f65cd22fbd..f30c463881 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -115,19 +115,19 @@ char	   *localized_full_months[12 + 1];
 /*
  * The precise version of a collation provider library is important, because
  * subtle changes in collation between library versions can corrupt
- * indexes. This hook allows control over how collation provider libraries are
- * loaded.
+ * indexes. These hooks allows control over how collation provider libraries
+ * are loaded.
  *
  * If the hook is not set, or if it returns NULL, Postgres constructs the
- * pg_icu_library structure from the symbols Postgres is linked with at build
- * time.
+ * pg_libc_library or pg_icu_library structures from the symbols Postgres is
+ * linked with at build time.
  *
- * If the hook is set, it can instead construct the pg_icu_library structure
- * using custom logic. Ordinarily, this custom logic would involve finding a
- * specific known version of the collation provider library, and dynamically
- * loading the necessary symbols. If the collation version obtained from the
- * library does not match the collation version recorded in the catalog,
- * Postgres will issue a WARNING.
+ * If the hook is set, it can instead construct the pg_libc_library or
+ * pg_icu_library structures using custom logic. Ordinarily, this custom logic
+ * would involve finding a specific known version of the collation provider
+ * library, and dynamically loading the necessary symbols. If the collation
+ * version obtained from the library does not match the collation version
+ * recorded in the catalog, Postgres will issue a WARNING.
  *
  * The returned structure must be allocated in TopMemoryContext, and the
  * associated symbols must remain valid permanently. It's expected that the
@@ -139,10 +139,12 @@ char	   *localized_full_months[12 + 1];
  * consider what other information can be provided to allow for greater
  * control over which library is loaded.
  */
+get_libc_library_hook_type get_libc_library_hook = NULL;
 #ifdef USE_ICU
 get_icu_library_hook_type get_icu_library_hook = NULL;
 #endif
 
+static pg_libc_library *builtin_libc_library = NULL;
 #ifdef USE_ICU
 static pg_icu_library *builtin_icu_library = NULL;
 #endif
@@ -173,6 +175,7 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+static pg_libc_library *default_libc = NULL;
 #ifdef USE_ICU
 static pg_icu_library *default_icu = NULL;
 #endif
@@ -1392,7 +1395,7 @@ lc_collate_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_COLLATE, NULL);
+		localeptr = default_libc->c_setlocale(LC_COLLATE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_COLLATE setting");
 
@@ -1445,7 +1448,7 @@ lc_ctype_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_CTYPE, NULL);
+		localeptr = default_libc->c_setlocale(LC_CTYPE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_CTYPE setting");
 
@@ -1503,6 +1506,82 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+static pg_libc_library *
+get_builtin_libc_library()
+{
+	pg_libc_library *lib = NULL;
+
+	if (builtin_libc_library != NULL)
+		return builtin_libc_library;
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_libc_library));
+#if defined(__GLIBC__)
+	lib->libc_version = gnu_get_libc_version;
+#elif defined(WIN32)
+	lib->GetNLSVersionEx = GetNLSVersionEx;
+#endif
+	lib->c_setlocale = setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = newlocale;
+	lib->c_freelocale = freelocale;
+	lib->c_uselocale = uselocale;
+#ifdef LC_VERSION_MASK
+	lib->c_querylocale = querylocale;
+#endif		/* LC_VERSION_MASK */
+#else
+	lib->_create_locale = _create_locale;
+#endif		/* WIN32 */
+#endif		/* HAVE_LOCALE_T */
+	lib->c_wcstombs = wcstombs;
+	lib->c_mbstowcs = mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = strcoll;
+	lib->c_wcscoll = wcscoll;
+	lib->c_strxfrm = strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = strcoll_l;
+	lib->c_wcscoll_l = wcscoll_l;
+	lib->c_strxfrm_l = strxfrm_l;
+#endif
+	lib->c_tolower = tolower;
+	lib->c_toupper = toupper;
+	lib->c_iswalnum = iswalnum;
+	lib->c_towlower = towlower;
+	lib->c_towupper = towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_tolower_l = tolower_l;
+	lib->c_toupper_l = toupper_l;
+	lib->c_iswalnum_l = iswalnum_l;
+	lib->c_towlower_l = towlower_l;
+	lib->c_towupper_l = towupper_l;
+#endif
+
+	builtin_libc_library = lib;
+	return lib;
+}
+
+static pg_libc_library *
+get_libc_library(const char *collate, const char *ctype, const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (get_libc_library_hook != NULL)
+		lib = get_libc_library_hook(collate, ctype, version);
+
+	if(!lib)
+		lib = get_builtin_libc_library();
+
+	return lib;
+}
+
 #ifdef USE_ICU
 static pg_icu_library *
 get_builtin_icu_library()
@@ -1614,6 +1693,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	if (provider == COLLPROVIDER_LIBC)
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = get_libc_library(collate, ctype, version);
 		locale_t        loc;
 
 		/* newlocale's result may be leaked if we encounter an error */
@@ -1623,10 +1703,10 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			/* Normal case where they're the same */
 			errno = 0;
 #ifndef WIN32
-			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+			loc = libc->c_newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
 							NULL);
 #else
-			loc = _create_locale(LC_ALL, collate);
+			loc = libc->_create_locale(LC_ALL, collate);
 #endif
 			if (!loc)
 				report_newlocale_failure(collate);
@@ -1638,11 +1718,11 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			locale_t	loc1;
 
 			errno = 0;
-			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			loc1 = libc->c_newlocale(LC_COLLATE_MASK, collate, NULL);
 			if (!loc1)
 				report_newlocale_failure(collate);
 			errno = 0;
-			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			loc = libc->c_newlocale(LC_CTYPE_MASK, ctype, loc1);
 			if (!loc)
 				report_newlocale_failure(ctype);
 #else
@@ -1659,6 +1739,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 		}
 
 		result->info.libc.lt = loc;
+		result->info.libc.lib = libc;
 #else							/* not HAVE_LOCALE_T */
 		/* platform that doesn't support locale_t */
 		ereport(ERROR,
@@ -1710,6 +1791,12 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+pg_libc_library *
+get_default_libc_library()
+{
+	return default_libc;
+}
+
 #ifdef USE_ICU
 pg_icu_library *
 get_default_icu_library()
@@ -1725,6 +1812,19 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+	default_libc = get_libc_library(collate, ctype, version);
+
+	/*
+	 * If it's the builtin libc, pg_perm_setlocale() will handle
+	 * setlocale(). Otherwise, it must be done here so that the collate and
+	 * ctype take effect.
+	 */
+	if (default_libc != builtin_libc_library)
+	{
+		default_libc->c_setlocale(LC_COLLATE, collate);
+		default_libc->c_setlocale(LC_CTYPE, ctype);
+	}
+
 #ifdef USE_ICU
 	default_icu = get_icu_library(iculocale, version);
 #endif
@@ -1913,19 +2013,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			pg_strncasecmp("C.", collcollate, 2) != 0 &&
 			pg_strcasecmp("POSIX", collcollate) != 0)
 	{
+		pg_libc_library *libc = get_libc_library(collcollate, NULL, NULL);
+
 #if defined(__GLIBC__)
 		/* Use the glibc version because we don't have anything better. */
-		collversion = pstrdup(gnu_get_libc_version());
+		if (libc->libc_version != NULL)
+			collversion = pstrdup(libc->libc_version());
 #elif defined(LC_VERSION_MASK)
 		locale_t	loc;
 
 		/* Look up FreeBSD collation version. */
-		loc = newlocale(LC_COLLATE, collcollate, NULL);
+		loc = libc->c_newlocale(LC_COLLATE, collcollate, NULL);
 		if (loc)
 		{
-			collversion =
-				pstrdup(querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, loc));
-			freelocale(loc);
+			const char *query = libc->c_querylocale(
+				LC_COLLATE_MASK | LC_VERSION_MASK, loc);
+			collversion = pstrdup(query);
+			libc->c_freelocale(loc);
 		}
 		else
 			ereport(ERROR,
@@ -1941,7 +2045,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 							LOCALE_NAME_MAX_LENGTH);
-		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
+		if (!libc->GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
 		{
 			/*
 			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
@@ -2031,10 +2135,14 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		result = libc->c_wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	}
 	else
 #endif
-		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+		result = default_libc->c_wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+
 	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
 								 * headers */
 		ereport(ERROR,
@@ -2060,7 +2168,6 @@ static int
 pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 {
 	int result;
-
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 #ifdef WIN32
 	if (GetDatabaseEncoding() == PG_UTF8)
@@ -2074,14 +2181,16 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
+		result = libc->c_strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
 #endif
 	}
 	else
-		result = strcoll(arg1, arg2);
+		result = default_libc->c_strcoll(arg1, arg2);
 
 	return result;
 }
@@ -2327,17 +2436,15 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 {
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 
-#ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		return libc->c_strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	}
 	else
 #endif
-		return strxfrm(dest, src, destsize);
-#else
-	/* shouldn't happen */
-	elog(ERROR, "unsupported collprovider: %c", locale->provider);
-#endif
+		return default_libc->c_strxfrm(dest, src, destsize);
 }
 
 static size_t
@@ -3010,21 +3117,23 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 	if (locale == (pg_locale_t) 0)
 	{
 		/* Use wcstombs directly for the default locale */
-		result = wcstombs(to, from, tolen);
+		result = default_libc->c_wcstombs(to, from, tolen);
 	}
 	else
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
+		result = libc->c_wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.libc.lt);
+		locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-		result = wcstombs(to, from, tolen);
+		result = libc->c_wcstombs(to, from, tolen);
 
-		uselocale(save_locale);
+		libc->c_uselocale(save_locale);
 #endif							/* HAVE_WCSTOMBS_L */
 #else							/* !HAVE_LOCALE_T */
 		/* Can't have locale != 0 without HAVE_LOCALE_T */
@@ -3087,21 +3196,23 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 		if (locale == (pg_locale_t) 0)
 		{
 			/* Use mbstowcs directly for the default locale */
-			result = mbstowcs(to, str, tolen);
+			result = default_libc->c_mbstowcs(to, str, tolen);
 		}
 		else
 		{
 #ifdef HAVE_LOCALE_T
+			pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
+			result = libc->c_mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.libc.lt);
+			locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-			result = mbstowcs(to, str, tolen);
+			result = libc->c_mbstowcs(to, str, tolen);
 
-			uselocale(save_locale);
+			libc->c_uselocale(save_locale);
 #endif							/* HAVE_MBSTOWCS_L */
 #else							/* !HAVE_LOCALE_T */
 			/* Can't have locale != 0 without HAVE_LOCALE_T */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index afb6c24d64..66a0e92a76 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -13,6 +13,9 @@
 #ifndef _PG_LOCALE_INTERNAL_
 #define _PG_LOCALE_INTERNAL_
 
+#include <wchar.h>
+#include <wctype.h>
+
 #ifdef USE_ICU
 #include <unicode/ubrk.h>
 #include <unicode/ucnv.h>
@@ -32,6 +35,75 @@
 #endif
 #endif
 
+typedef struct pg_libc_library
+{
+	/* version */
+#if defined(__GLIBC__)
+	const char *(*libc_version) (void);
+#elif defined(WIN32)
+	BOOL (*GetNLSVersionEx) (NLS_FUNCTION function, LPCWSTR lpLocaleName,
+							 LPNLSVERSIONINFOEX lpVersionInformation);
+#endif
+
+	/* locale */
+	char *(*c_setlocale) (int category, const char *locale);
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	locale_t (*c_newlocale) (int category_mask, const char *locale,
+						   locale_t base);
+	void (*c_freelocale) (locale_t locobj);
+	locale_t (*c_uselocale) (locale_t newloc);
+#ifdef LC_VERSION_MASK
+	const char *(*c_querylocale) (int mask, locale_t locale);
+#endif
+#else
+	locale_t (*_create_locale) (int category, const char *locale);
+#endif
+#endif
+
+	/* encoding */
+	size_t (*c_wcstombs) (char *dest, const wchar_t *src, size_t n);
+	size_t (*c_mbstowcs) (wchar_t *dest, const char *src, size_t n);
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	size_t (*c_wcstombs_l) (char *dest, const wchar_t *src, size_t n,
+						  locale_t loc);
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	size_t (*c_mbstowcs_l) (wchar_t *dest, const char *src, size_t n,
+						  locale_t loc);
+#endif
+#endif
+
+	/* collation */
+	int (*c_strcoll) (const char *s1, const char *s2);
+	int (*c_wcscoll) (const wchar_t *ws1, const wchar_t *ws2);
+	size_t (*c_strxfrm) (char *s1, const char * s2, size_t n);
+#ifdef HAVE_LOCALE_T
+	int (*c_strcoll_l) (const char *s1, const char *s2, locale_t locale);
+	int (*c_wcscoll_l) (const wchar_t *ws1, const wchar_t *ws2,
+					  locale_t locale);
+	size_t (*c_strxfrm_l) (char *s1, const char * s2, size_t n,
+						 locale_t locale);
+#endif
+
+	/* ctype */
+	int (*c_tolower) (int c);
+	int (*c_toupper) (int c);
+	int (*c_iswalnum) (wint_t wc);
+	wint_t (*c_towlower) (wint_t wc);
+	wint_t (*c_towupper) (wint_t wc);
+#ifdef HAVE_LOCALE_T
+	int (*c_tolower_l) (int c, locale_t locale);
+	int (*c_toupper_l) (int c, locale_t locale);
+	int (*c_iswalnum_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towlower_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towupper_l) (wint_t wc, locale_t locale);
+#endif
+} pg_libc_library;
+
+#define PG_LIBC_LIB(x) ((x)->info.libc.lib)
+
 #ifdef USE_ICU
 /*
  * An ICU library version that we're either linked against or have loaded at
@@ -146,12 +218,13 @@ struct pg_locale_struct
 	char	   *ctype;
 	union
 	{
-#ifdef HAVE_LOCALE_T
 		struct
 		{
+#ifdef HAVE_LOCALE_T
 			locale_t	lt;
-		}			libc;
 #endif
+			pg_libc_library *lib;
+		}			libc;
 #ifdef USE_ICU
 		struct
 		{
@@ -159,10 +232,16 @@ struct pg_locale_struct
 			pg_icu_library	*lib;
 		}			icu;
 #endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+typedef pg_libc_library *(*get_libc_library_hook_type)(
+	const char *collate, const char *ctype, const char *version);
+
+extern PGDLLIMPORT get_libc_library_hook_type get_libc_library_hook;
+
+extern pg_libc_library *get_default_libc_library(void);
+
 #ifdef USE_ICU
 
 typedef pg_icu_library *(*get_icu_library_hook_type)(
-- 
2.34.1

v5-0005-Add-test-module-for-icu-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v5-0005-Add-test-module-for-icu-collation-provider-hook.patchDownload

From 90dab86014d610cc7740874ed8faaee37582ed4f Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:50:10 -0800
Subject: [PATCH v5 5/7] Add test module for icu collation provider hook.

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_collation_lib_hooks/Makefile |  24 ++
 .../test_collation_lib_hooks/meson.build      |  37 +++
 .../test_collation_lib_hooks/t/001_icu.pl     | 153 ++++++++++++
 .../test_collation_lib_hooks.c                |  43 ++++
 .../test_collation_lib_hooks.control          |   4 +
 .../test_collation_lib_hooks.h                |  28 +++
 .../test_collation_lib_hooks/test_icu_hook.c  | 228 ++++++++++++++++++
 9 files changed, 519 insertions(+)
 create mode 100644 src/test/modules/test_collation_lib_hooks/Makefile
 create mode 100644 src/test/modules/test_collation_lib_hooks/meson.build
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/001_icu.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_icu_hook.c

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..261bf5e729 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -15,6 +15,7 @@ SUBDIRS = \
 		  snapshot_too_old \
 		  spgist_name_ops \
 		  test_bloomfilter \
+		  test_collation_lib_hooks \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
 		  test_ddl_deparse \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index eefc0b2063..1530ee4970 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -11,6 +11,7 @@ subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
+subdir('test_collation_lib_hooks')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
 subdir('test_ddl_deparse')
diff --git a/src/test/modules/test_collation_lib_hooks/Makefile b/src/test/modules/test_collation_lib_hooks/Makefile
new file mode 100644
index 0000000000..05948e555a
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_collation_lib_hooks/Makefile
+
+MODULE_big = test_collation_lib_hooks
+OBJS = \
+	$(WIN32RES) \
+	test_collation_lib_hooks.o test_icu_hook.o
+PGFILEDESC = "test_collation_lib_hooks - test collation provider library hooks"
+
+EXTENSION = test_collation_lib_hooks
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_collation_lib_hooks
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+export with_icu
diff --git a/src/test/modules/test_collation_lib_hooks/meson.build b/src/test/modules/test_collation_lib_hooks/meson.build
new file mode 100644
index 0000000000..56b32b6cd1
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/meson.build
@@ -0,0 +1,37 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_collation_lib_hooks_sources = files(
+  'test_collation_lib_hooks.c',
+  'test_icu_hook.c',
+)
+
+if host_system == 'windows'
+  test_collation_lib_hooks_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_collation_lib_hooks',
+    '--FILEDESC', 'test_collation_lib_hooks - test collation provider library hooks',])
+endif
+
+test_collation_lib_hooks = shared_module('test_collation_lib_hooks',
+  test_collation_lib_hooks_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_collation_lib_hooks
+
+install_data(
+  'test_collation_lib_hooks.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_collation_lib_hooks',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_icu.pl',
+    ],
+    'env': {
+      'with_icu': icu.found() ? 'yes' : 'no',
+    },
+  },
+}
diff --git a/src/test/modules/test_collation_lib_hooks/t/001_icu.pl b/src/test/modules/test_collation_lib_hooks/t/001_icu.pl
new file mode 100644
index 0000000000..e6f5372445
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/001_icu.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'DESC' ENCODING 'UTF8' TEMPLATE template0}
+);
+
+# setup
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_asc (PROVIDER=icu, LOCALE='ASC')]);
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_desc (PROVIDER=icu, LOCALE='DESC')]);
+
+$node->safe_psql('dbicu', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('dbicu',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_db =
+  $node->safe_psql('dbicu',
+	  qq[SELECT datcollversion FROM pg_database WHERE datname='dbicu']);
+is($version_db, '2.72',
+	'database "dbicu" has correct version 2.72'
+);
+
+my $version_asc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, '2.72',
+	'collation "test_asc" has correct version 2.72'
+);
+
+my $version_desc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, '2.72',
+	'collation "test_desc" has correct version 2.72'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison (database)
+
+my $dbcomparison_reverse =
+  $node->safe_psql('dbicu', qq[SELECT 'aBcD' < 'wXyZ']);
+is($dbcomparison_reverse, 'f',
+	'correct reverse comparison (database)'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort
+
+my $res_sort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort'
+);
+
+# test desc sort
+
+my $res_db_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t]);
+is($res_db_reversesort, $res_reversesort_expected,
+	'correct descending sort (database)'
+);
+
+# test desc sort
+
+my $res_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper (database)
+
+my $tcase_db_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ'),
+                upper('aBcDfgHiwXyZ')]);
+is($tcase_db_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase (database)'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
new file mode 100644
index 0000000000..599ec61239
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
@@ -0,0 +1,43 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.c
+ *		Code for testing collation provider library hooks
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
+ *
+ * Test implementation of icu-like collation provider.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef USE_ICU
+static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
+#endif
+
+PG_MODULE_MAGIC;
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR, (errmsg("test_collation_lib_hooks must be loaded via shared_preload_libraries")));
+
+#ifdef USE_ICU
+	prev_get_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = test_get_icu_library;
+#endif
+}
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
new file mode 100644
index 0000000000..a0b8e031a4
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
@@ -0,0 +1,4 @@
+comment = 'Test code for collation provider library hooks'
+default_version = '1.0'
+module_pathname = '$libdir/test_collation_lib_hooks'
+
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
new file mode 100644
index 0000000000..e6ee457ab3
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
@@ -0,0 +1,28 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.h
+ *		Definitions for collation library hooks.
+ *
+ * Copyright (c) 2015-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#ifndef TEST_COLLATION_LIB_HOOKS_H
+#define TEST_COLLATION_LIB_HOOKS_H
+
+#include "postgres.h"
+
+#include "utils/memutils.h"
+#include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
+
+#ifdef USE_ICU
+extern pg_icu_library *test_get_icu_library(const char *locale,
+											const char *version);
+#endif
+
+#endif
diff --git a/src/test/modules/test_collation_lib_hooks/test_icu_hook.c b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
new file mode 100644
index 0000000000..ae257cc03b
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_icu_hook.c
+ *		Code for testing collation provider icu hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_icu_hook.c
+ *
+ * Implements a custom icu-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "en_US" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 2.72, so loading it will cause a version
+ * mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef USE_ICU
+
+#include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
+
+#define TEST_LOCALE "en_US"
+
+typedef struct TestUCollator {
+	UCollator	*ucol;
+	bool		 reverse;
+} TestUCollator;
+
+static pg_icu_library *test_icu_library = NULL;
+static const UVersionInfo test_icu_version = { 2, 72 };
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static UCollator *
+test_openCollator(const char *loc, UErrorCode *status)
+{
+	TestUCollator *testcol = MemoryContextAlloc(TopMemoryContext, sizeof(TestUCollator));
+	UCollator *ucol = ucol_open(TEST_LOCALE, status);
+	testcol->ucol = ucol;
+	testcol->reverse = locale_is_reverse(loc);
+	return (UCollator *)testcol;
+}
+
+static void
+test_closeCollator(UCollator *coll)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_close(testcol->ucol);
+	pfree(testcol);
+}
+
+static void
+test_setAttribute(UCollator *coll, UColAttribute attr,
+				  UColAttributeValue value, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_setAttribute(testcol->ucol, attr, value, status);
+}
+
+static void
+test_getCollatorVersion(const UCollator *coll, UVersionInfo info)
+{
+	memcpy(info, test_icu_version, sizeof(UVersionInfo));
+}
+
+static UCollationResult
+test_strcoll(const UCollator *coll, const UChar *source, int32_t sourceLength,
+			 const UChar *target, int32_t targetLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcoll(testcol->ucol, source, sourceLength,
+										target, targetLength);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static UCollationResult
+test_strcollUTF8(const UCollator *coll, const char *source,
+				 int32_t sourceLength, const char *target,
+				 int32_t targetLength, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcollUTF8(testcol->ucol, source,
+											sourceLength, target,
+											targetLength, status);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static int32_t
+test_getSortKey(const UCollator *coll, const UChar *source,
+				int32_t sourceLength, uint8_t *result, int32_t resultLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_getSortKey(testcol->ucol, source, sourceLength,
+								  result, resultLength);
+	size_t result_size = ret + 1;
+
+	if (resultLength >= result_size)
+	{
+		result[resultLength] = '\0';
+
+		if (testcol->reverse)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) result + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+
+static int32_t
+test_nextSortKeyPart(const UCollator *coll, UCharIterator *iter,
+					 uint32_t state[2], uint8_t *dest, int32_t count,
+					 UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_nextSortKeyPart(testcol->ucol, iter, state, dest,
+									   count, status);
+
+	if (testcol->reverse)
+		for (int i = 0; i < ret; i++)
+			*((unsigned char *) dest + i) ^= (unsigned char) 0xff;
+
+	/*
+	 * The following is not correct for cases where we finish precisely on the
+	 * boundary (i.e. count is exactly enough). To fix this we'd need to track
+	 * additional state across calls, which doesn't seem worth it for a test
+	 * case.
+	 */
+	if (count >= ret && ret > 0)
+	{
+		if (testcol->reverse)
+			dest[ret] = 0xff;
+		else
+			dest[ret] = '\0';
+		return ret + 1;
+	}
+
+	return ret;
+}
+
+static int32_t
+test_strToUpper(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+static int32_t
+test_strToLower(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+pg_icu_library *
+test_get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib;
+
+	if (test_icu_library != NULL)
+		return test_icu_library;
+
+	ereport(LOG, (errmsg("loading custom ICU provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = test_openCollator;
+	lib->closeCollator = test_closeCollator;
+	lib->getCollatorVersion = test_getCollatorVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = test_strcoll;
+	lib->strcollUTF8 = test_strcollUTF8;
+	lib->getSortKey = test_getSortKey;
+	lib->nextSortKeyPart = test_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = test_strToUpper;
+	lib->strToLower = test_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = test_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	test_icu_library = lib;
+	return lib;
+}
+
+#endif				/* USE_ICU */
-- 
2.34.1

v5-0004-Support-multiple-ICU-collation-provider-libraries.patchtext/x-patch; charset=UTF-8; name=v5-0004-Support-multiple-ICU-collation-provider-libraries.patchDownload

From 3e9cfe81d8743ad5a19b3d703134800fad69b3bf Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v5 4/7] Support multiple ICU collation provider libraries.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

Offer a hook to allow an extension to control which ICU library is
loaded for a given locale and version.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++--
 src/backend/utils/adt/pg_locale.c      | 299 ++++++++++++++++++-------
 src/include/utils/pg_locale_internal.h | 121 +++++++++-
 4 files changed, 398 insertions(+), 106 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 9e84da4891..4e90bfa65e 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -560,13 +560,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -585,11 +586,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -809,12 +811,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_default_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -825,7 +828,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 3067431cbe..7f318fe2c7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1565,6 +1565,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1572,6 +1577,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1589,18 +1595,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1666,11 +1696,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1788,11 +1819,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1911,11 +1943,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index d8df2a3832..f65cd22fbd 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -110,6 +112,41 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * The precise version of a collation provider library is important, because
+ * subtle changes in collation between library versions can corrupt
+ * indexes. This hook allows control over how collation provider libraries are
+ * loaded.
+ *
+ * If the hook is not set, or if it returns NULL, Postgres constructs the
+ * pg_icu_library structure from the symbols Postgres is linked with at build
+ * time.
+ *
+ * If the hook is set, it can instead construct the pg_icu_library structure
+ * using custom logic. Ordinarily, this custom logic would involve finding a
+ * specific known version of the collation provider library, and dynamically
+ * loading the necessary symbols. If the collation version obtained from the
+ * library does not match the collation version recorded in the catalog,
+ * Postgres will issue a WARNING.
+ *
+ * The returned structure must be allocated in TopMemoryContext, and the
+ * associated symbols must remain valid permanently. It's expected that the
+ * hook will maintain a small finite list of libraries and return a pointer to
+ * the appropraite one, so that resource use remains bounded.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU locale
+ * name, and the collation version as recorded in the catalog. We should
+ * consider what other information can be provided to allow for greater
+ * control over which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
+#ifdef USE_ICU
+static pg_icu_library *builtin_icu_library = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -136,8 +173,12 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+#ifdef USE_ICU
+static pg_icu_library *default_icu = NULL;
+#endif
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -145,13 +186,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1458,6 +1503,78 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+static pg_icu_library *
+get_builtin_icu_library()
+{
+	pg_icu_library *lib;
+
+	if (builtin_icu_library != NULL)
+		return builtin_icu_library;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	builtin_icu_library = lib;
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib = NULL;
+
+	if (get_icu_library_hook != NULL)
+		lib = get_icu_library_hook(locale, version);
+
+	if (!lib)
+		lib = get_builtin_icu_library();
+
+	return lib;
+}
+
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1554,20 +1671,24 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_icu_library(collate, version);
+
+		Assert(strcmp(collate, ctype) == 0);
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1589,6 +1710,14 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+#ifdef USE_ICU
+pg_icu_library *
+get_default_icu_library()
+{
+	return default_icu;
+}
+#endif
+
 /*
  * Initialize default database locale.
  */
@@ -1596,6 +1725,10 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+#ifdef USE_ICU
+	default_icu = get_icu_library(iculocale, version);
+#endif
+
 	/*
 	 * For the purposes of pg_locale_t, if the provider is ICU, we use
 	 * iculocale for both collate and ctype.
@@ -1759,17 +1892,18 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_icu_library(collcollate, NULL);
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2021,16 +2155,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2041,12 +2176,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2067,6 +2202,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2076,13 +2212,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2261,12 +2398,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2275,11 +2413,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2308,13 +2446,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2323,21 +2462,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2346,6 +2483,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2356,19 +2494,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2564,7 +2700,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2581,11 +2717,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2594,14 +2730,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2610,16 +2747,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2636,16 +2774,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2663,30 +2802,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2702,7 +2843,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2776,7 +2918,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2785,7 +2927,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
@@ -2801,17 +2943,18 @@ check_icu_locale(const char *icu_locale)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	pg_icu_library *iculib = get_icu_library(icu_locale, NULL);
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = iculib->openCollator(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, iculib->errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+		icu_set_collation_attributes(iculib, collator, icu_locale);
+	iculib->closeCollator(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..afb6c24d64 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,106 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +155,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +164,18 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
+
+typedef pg_icu_library *(*get_icu_library_hook_type)(
+	const char *locale, const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
+extern pg_icu_library *get_default_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
+
+#endif							/* USE_ICU */
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v5-0003-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v5-0003-Refactor-pg_locale_t-routines.patchDownload

From 83246a735a1be08af4176fda94b40c72a55bbf61 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v5 3/7] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |   8 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 400 insertions(+), 323 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index b8136e496f..6d9f014c5b 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -281,36 +281,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -336,7 +328,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -344,30 +336,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 81e54e0ce6..9e84da4891 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 02d462a659..ac05efb558 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 65746c48d2..3067431cbe 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 
 /* ----------
@@ -1576,7 +1577,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1584,7 +1585,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1697,7 +1698,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1730,7 +1731,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1819,7 +1820,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1852,7 +1853,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1944,10 +1945,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1989,10 +1990,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 8e671b9fab..98714a0492 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 
 #define LIKE_TRUE						1
@@ -96,7 +97,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 2d3aaaaf6b..28d23ac3ab 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 663286163e..d8df2a3832 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2111,7 +2193,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2717,8 +2799,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2792,10 +2874,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2869,10 +2951,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 1963d4dc4e..9782d02113 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8da545841f..0408995791 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1205,7 +1205,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1603,7 +1603,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1662,7 +1662,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1776,7 +1776,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..ead7922c4e 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 2bd96ab7e1..79cfacb302 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -65,39 +49,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -115,10 +72,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v5-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchtext/x-patch; charset=UTF-8; name=v5-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchDownload

From b17a12570c36ccd040ef6dc83f30b0a0234dba8b Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:41:38 -0800
Subject: [PATCH v5 2/7] Add pg_strxfrm() and pg_strxfrm_prefix().

Callers with a NUL-terminated string should call the former; callers
with a string and length should call the latter.

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 ++--
 src/backend/utils/adt/pg_locale.c   | 381 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varchar.c     |  41 ++-
 src/backend/utils/adt/varlena.c     | 142 +++--------
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h       |  11 +
 6 files changed, 481 insertions(+), 150 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index f890f79ee1..b8136e496f 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -291,21 +291,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -349,21 +347,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6cd629ecb4..663286163e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -94,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -2099,6 +2102,384 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 }
 
 
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_libc(char *dest, const char *src, size_t destsize,
+					   pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+		 locale->provider);
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_prefix_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
 #ifdef USE_ICU
 static void
 init_icu_converter(void)
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 01a2db6b23..1963d4dc4e 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 52dd0bbba8..8da545841f 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1889,20 +1889,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -1918,13 +1904,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -1937,10 +1917,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2229,6 +2207,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2241,7 +2220,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2276,14 +2255,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2301,7 +2276,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2309,81 +2284,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2395,12 +2338,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 436afe1d21..03dfe9b7f1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1948,6 +1948,17 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"trust_strxfrm", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
 			gettext_noop("Whether to continue running after a failure to sync data files."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index bf70ae08ca..2bd96ab7e1 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool trust_strxfrm;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -103,6 +104,16 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int pg_strncoll(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v5-0001-Add-pg_strcoll-and-pg_strncoll.patchtext/x-patch; charset=UTF-8; name=v5-0001-Add-pg_strcoll-and-pg_strncoll.patchDownload

From 9f784cb67a652ee04b43d2a0cfc708a9bcd98a11 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v5 1/7] Add pg_strcoll() and pg_strncoll().

Callers with NUL-terminated strings should call the former; callers
with strings and their length should call the latter.
---
 src/backend/utils/adt/pg_locale.c | 420 ++++++++++++++++++++++++++++--
 src/backend/utils/adt/varlena.c   | 230 +---------------
 src/include/utils/pg_locale.h     |   3 +
 3 files changed, 406 insertions(+), 247 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2b42d9ccd8..6cd629ecb4 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -123,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1750,356 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
 
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
+	return result;
+}
+
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2127,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2175,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 1c52deec55..52dd0bbba8 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1537,10 +1537,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1557,171 +1553,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
-
-		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 	}
 
 	return result;
@@ -2379,65 +2211,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
-
-	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
-		result = strcmp(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index a875942123..bf70ae08ca 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,9 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Jeff Davis (#5)

Re: Rework of collation code, extensibility

On 22.12.22 06:40, Jeff Davis wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

I'm confused by this patch set.

It combines some refactoring that was previously posted with partial
support for multiple ICU libraries with partial support for some new
hooks. Shouldn't those be three separate threads? I think the multiple
ICU libraries already does have a separate thread; how does this relate
to that work? I don't know what the hooks are supposed to be for? What
other locale libraries are you thinking about using this way? How can
we asses whether these interfaces are sufficient for that? The
refactoring patches don't look convincing just by looking at the numbers:

3 files changed, 406 insertions(+), 247 deletions(-)
6 files changed, 481 insertions(+), 150 deletions(-)
12 files changed, 400 insertions(+), 323 deletions(-)

My sense is this is trying to do too many things at once, and those
things are each not fully developed yet.

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#6)

Re: Rework of collation code, extensibility

On Wed, 2023-01-04 at 22:46 +0100, Peter Eisentraut wrote:

It combines some refactoring that was previously posted with partial
support for multiple ICU libraries with partial support for some new
hooks. Shouldn't those be three separate threads?

Originally they felt more separate to me, too; but as I worked on them
it seemed better to consider them as a patch series. Whatever is easier
for reviewers works for me, though.

I think the multiple
ICU libraries already does have a separate thread; how does this
relate
to that work?

Multilib ICU support adds complexity, and my hope is that this patch
set cleans up and organizes things to better prepare for that
complexity.

I don't know what the hooks are supposed to be for?

I found them very useful for testing during development. One of the
patches adds a test module for the ICU hook, and I think that's a
valuable place to test regardless of whether any other extension uses
the hook. Also, if proper multilib support doesn't land in 16, then the
hooks could be a way to build rudimentary multilib support (or at least
some kind of ICU version lockdown) until it does land.

When Thomas's work is in place, I expect the hooks to change slightly.
The hooks are not meant to set any specific API in stone.

What
other locale libraries are you thinking about using this way? How
can
we asses whether these interfaces are sufficient for that?

I'm not considering any other locale libraries, nor did I see much
discussion of that.

The
refactoring patches don't look convincing just by looking at the
numbers:

3 files changed, 406 insertions(+), 247 deletions(-)
6 files changed, 481 insertions(+), 150 deletions(-)
12 files changed, 400 insertions(+), 323 deletions(-)

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter.eisentraut@enterprisedb.com

about 3 years ago

In reply to: Jeff Davis (#7)

Re: Rework of collation code, extensibility

On 06.01.23 08:04, Jeff Davis wrote:

The existing code is not great, in my opinion: it doesn't have clear
API boundaries, the comments are insufficient, and lots of special
cases need to be handled awkwardly by callers. That style is hard to
beat when it comes to the raw line count; but it's quite difficult to
understand and work on.

I think my changes are an improvement, but obviously that depends on
the opinion of others who are working in this part of the code. What do
you think?

I think the refactoring that you proposed in the thread "Refactor to
introduce pg_strcoll()." was on a sensible track. Maybe we should try
to get that done. The multiple-ICU stuff is still experimental and has
its own rather impressive thread, so I don't think it's sensible to try
to sort that out here.

vignesh C

vignesh21@gmail.com

about 3 years ago

In reply to: Jeff Davis (#5)

Re: Rework of collation code, extensibility

On Thu, 22 Dec 2022 at 11:11, Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote:

Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2]
in
a new thread given its new scope.

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

The libc hook support is still experimental, but what's working is
passing in CI, even on windows. The challenges with libc hook support
are:

* It obviously doesn't replace all of libc, so the separation is not
as clean and there are a number of callers throughout the code that
don't necessarily care about specific collations.

* libc relies on setlocale() / uselocale(), which is global state and
not as easy to track.

* More platform issues (obviously) and harder to test.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4058.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
c971a5b27ac946e7c94f7f655d321279512c7ee7 ===
=== applying patch ./v5-0003-Refactor-pg_locale_t-routines.patch
....
Hunk #1 FAILED at 88.
...
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/utils/adt/formatting.c.rej
patching file src/backend/utils/adt/like.c
Hunk #1 FAILED at 24.
Hunk #2 succeeded at 97 (offset 1 line).
1 out of 2 hunks FAILED -- saving rejects to file
src/backend/utils/adt/like.c.rej

[1]: http://cfbot.cputube.org/patch_41_4058.log

Regards,
Vignesh

#10

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Jeff Davis (#5)

7 attachment(s)

Re: Rework of collation code, extensibility

On Wed, 2022-12-21 at 21:40 -0800, Jeff Davis wrote:

Here's version 5. There are a number of fixes, and better tests, and
it's passing in CI.

Attached trivial rebase as v6.

The libc hook support is still experimental

Patches 0006 and 0007 should still be considered experimental and don't
require review right now.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v6-0001-Add-pg_strcoll-and-pg_strncoll.patchtext/x-patch; charset=UTF-8; name=v6-0001-Add-pg_strcoll-and-pg_strncoll.patchDownload

From 2f9fe947dd1ea2b8c798e742f2e67be0496e7f05 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v6 1/7] Add pg_strcoll() and pg_strncoll().

Callers with NUL-terminated strings should call the former; callers
with strings and their length should call the latter.
---
 src/backend/utils/adt/pg_locale.c | 420 ++++++++++++++++++++++++++++--
 src/backend/utils/adt/varlena.c   | 230 +---------------
 src/include/utils/pg_locale.h     |   3 +
 3 files changed, 406 insertions(+), 247 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..5ff1dcb1ba 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -123,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1750,356 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
 
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * If the collation is deterministic, break ties with strcmp().
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+		result = strcmp(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * If the collation is deterministic, break ties with memcmp(), and then with
+ * the string length.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	/* Break tie if necessary. */
+	if (result == 0 && (!locale || locale->deterministic))
+	{
+		result = memcmp(arg1, arg2, Min(len1, len2));
+		if ((result == 0) && (len1 != len2))
+			result = (len1 < len2) ? -1 : 1;
+	}
+
+	return result;
+}
+
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2127,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2175,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 33ffdb013a..8f64a3c905 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1537,10 +1537,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1557,171 +1553,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
-
-		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 	}
 
 	return result;
@@ -2379,65 +2211,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
-
-	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
-		result = strcmp(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
 	sss->cache_blob = false;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..40c0d1e3a0 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,9 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v6-0007-Add-test-module-for-libc-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v6-0007-Add-test-module-for-libc-collation-provider-hook.patchDownload

From 286515ab17d5d872a4177cbee49f2c289abdcfbb Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:52:23 -0800
Subject: [PATCH v6 7/7] Add test module for libc collation provider hook.

---
 .../modules/test_collation_lib_hooks/Makefile |  10 +-
 .../test_collation_lib_hooks/meson.build      |  10 +
 .../test_collation_lib_hooks/t/002_libc.pl    | 157 +++++
 .../t/003_libc_mixed_collation.pl             |  66 +++
 .../t/004_libc_version.pl                     |  66 +++
 .../test_collation_lib_hooks.c                |   8 +-
 .../test_collation_lib_hooks.h                |   4 +
 .../test_collation_lib_hooks/test_libc_hook.c | 555 ++++++++++++++++++
 8 files changed, 874 insertions(+), 2 deletions(-)
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/002_libc.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_libc_hook.c

diff --git a/src/test/modules/test_collation_lib_hooks/Makefile b/src/test/modules/test_collation_lib_hooks/Makefile
index 05948e555a..a800c9ea74 100644
--- a/src/test/modules/test_collation_lib_hooks/Makefile
+++ b/src/test/modules/test_collation_lib_hooks/Makefile
@@ -3,7 +3,7 @@
 MODULE_big = test_collation_lib_hooks
 OBJS = \
 	$(WIN32RES) \
-	test_collation_lib_hooks.o test_icu_hook.o
+	test_collation_lib_hooks.o test_icu_hook.o test_libc_hook.o
 PGFILEDESC = "test_collation_lib_hooks - test collation provider library hooks"
 
 EXTENSION = test_collation_lib_hooks
@@ -22,3 +22,11 @@ include $(top_srcdir)/contrib/contrib-global.mk
 endif
 
 export with_icu
+
+# TODO set environment variables:
+#  determine if built with glibc
+#  determine if LC_VERSION_MASK is defined
+
+ifeq ($(PORTNAME), win32)
+	export win32 = yes
+endif
diff --git a/src/test/modules/test_collation_lib_hooks/meson.build b/src/test/modules/test_collation_lib_hooks/meson.build
index 56b32b6cd1..a588751cc1 100644
--- a/src/test/modules/test_collation_lib_hooks/meson.build
+++ b/src/test/modules/test_collation_lib_hooks/meson.build
@@ -2,6 +2,7 @@
 
 test_collation_lib_hooks_sources = files(
   'test_collation_lib_hooks.c',
+  'test_libc_hook.c',
   'test_icu_hook.c',
 )
 
@@ -22,6 +23,9 @@ install_data(
   kwargs: contrib_data_args,
 )
 
+have_lc_version_mask = cc.has_header_symbol('locale.h', 'LC_VERSION_MASK')
+glibc = cc.has_header_symbol('features.h', '__GLIBC__')
+
 tests += {
   'name': 'test_collation_lib_hooks',
   'sd': meson.current_source_dir(),
@@ -29,9 +33,15 @@ tests += {
   'tap': {
     'tests': [
       't/001_icu.pl',
+      't/002_libc.pl',
+      't/003_libc_mixed_collation.pl',
+      't/004_libc_version.pl',
     ],
     'env': {
       'with_icu': icu.found() ? 'yes' : 'no',
+      'win32': (host_system == 'windows') ? 'yes' : 'no',
+      'glibc': glibc ? 'yes' : 'no',
+      'have_lc_version_mask': have_lc_version_mask ? 'yes' : 'no',
     },
   },
 }
diff --git a/src/test/modules/test_collation_lib_hooks/t/002_libc.pl b/src/test/modules/test_collation_lib_hooks/t/002_libc.pl
new file mode 100644
index 0000000000..e43a0916d3
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/002_libc.pl
@@ -0,0 +1,157 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_asc (PROVIDER=libc, LOCALE='ASC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_desc (PROVIDER=libc, LOCALE='DESC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+my $sort_asc_expected = "aBcD
+fGhI
+wXyZ";
+
+my $sort_desc_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison_asc, 't',
+	'correct comparison'
+);
+
+# test desc comparison
+
+my $comparison_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_desc, 'f',
+	'correct desc comparison'
+);
+
+# test asc sort with trust_strxfrm = false
+
+my $sort_asc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($sort_asc, $sort_asc_expected,
+	'correct ascending sort (trust_strxfrm = false)'
+);
+
+# test desc sort with trust_strxfrm = false
+
+my $sort_desc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = false;
+         SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($sort_desc, $sort_desc_expected,
+	'correct descending sort (trust_strxfrm = false)'
+);
+
+# test asc sort with trust_strxfrm = true
+
+my $strxfrm_asc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($strxfrm_asc, $sort_asc_expected,
+	'correct ascending sort (trust_strxfrm = true)'
+);
+
+# test desc sort with trust_strxfrm = true
+
+my $strxfrm_desc =
+  $node->safe_psql('postgres',
+	  qq[SET trust_strxfrm = true;
+         SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($strxfrm_desc, $sort_desc_expected,
+	'correct descending sort (trust_strxfrm = true)'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test desc lower/upper
+
+my $tcase_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_desc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct desc lowercase and uppercase'
+);
+
+if ($ENV{win32} ne 'yes') {
+  $node->safe_psql('postgres',
+     qq[CREATE COLLATION test_mixed_asc_desc
+         (PROVIDER=libc, LC_COLLATE='ASC', LC_CTYPE='DESC')]);
+  $node->safe_psql('postgres',
+     qq[CREATE COLLATION test_mixed_desc_asc
+         (PROVIDER=libc, LC_COLLATE='DESC', LC_CTYPE='ASC')]);
+
+  my $mcomparison_asc =
+    $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_mixed_asc_desc <
+                'wXyZ' COLLATE test_mixed_asc_desc]);
+  is($mcomparison_asc, 't',
+	'correct mixed asc/desc comparison'
+  );
+
+  my $mcomparison_desc =
+    $node->safe_psql('postgres',
+	  qq[SELECT 'aBcD' COLLATE test_mixed_desc_asc <
+                'wXyZ' COLLATE test_mixed_desc_asc]);
+  is($mcomparison_desc, 'f',
+	'correct mixed desc/asc comparison'
+  );
+
+  my $mcase_asc =
+    $node->safe_psql('postgres',
+	    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_asc_desc),
+                  upper('aBcDfgHiwXyZ' collate test_mixed_asc_desc)]);
+  is($mcase_asc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+    'correct case mixed asc/desc'
+  );
+
+  my $mcase_desc =
+    $node->safe_psql('postgres',
+	    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_desc_asc),
+                  upper('aBcDfgHiwXyZ' collate test_mixed_desc_asc)]);
+  is($mcase_desc, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+    'correct case mixed desc/asc'
+  );
+}
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl b/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
new file mode 100644
index 0000000000..2ef18f797c
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/003_libc_mixed_collation.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test mixed collations with differing lc_collate/lc_ctype
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{win32} eq 'yes')
+{
+	plan skip_all => 'windows does not support mixed libc collations';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+   qq[CREATE COLLATION test_mixed_asc_desc
+         (PROVIDER=libc, LC_COLLATE='ASC', LC_CTYPE='DESC')]);
+$node->safe_psql('postgres',
+   qq[CREATE COLLATION test_mixed_desc_asc
+         (PROVIDER=libc, LC_COLLATE='DESC', LC_CTYPE='ASC')]);
+
+my $mcomparison_asc =
+  $node->safe_psql('postgres',
+    qq[SELECT 'aBcD' COLLATE test_mixed_asc_desc <
+         'wXyZ' COLLATE test_mixed_asc_desc]);
+is($mcomparison_asc, 't',
+  'correct mixed asc/desc comparison'
+);
+
+my $mcomparison_desc =
+  $node->safe_psql('postgres',
+    qq[SELECT 'aBcD' COLLATE test_mixed_desc_asc <
+              'wXyZ' COLLATE test_mixed_desc_asc]);
+is($mcomparison_desc, 'f',
+'correct mixed desc/asc comparison'
+);
+
+my $mcase_asc =
+  $node->safe_psql('postgres',
+    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_asc_desc),
+              upper('aBcDfgHiwXyZ' collate test_mixed_asc_desc)]);
+is($mcase_asc, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+  'correct case mixed asc/desc'
+);
+
+my $mcase_desc =
+  $node->safe_psql('postgres',
+    qq[SELECT lower('aBcDfgHiwXyZ' collate test_mixed_desc_asc),
+              upper('aBcDfgHiwXyZ' collate test_mixed_desc_asc)]);
+is($mcase_desc, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+  'correct case mixed desc/asc'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl b/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
new file mode 100644
index 0000000000..201fa98e6b
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/004_libc_version.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+# Test collation versions (platform-specific)
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $version_asc_expected;
+my $version_desc_expected;
+
+if ($ENV{glibc} eq 'yes') {
+	$version_asc_expected = '3.14159';
+	$version_desc_expected = '3.14159';
+} elsif ($ENV{have_lc_version_mask} eq 'yes') {
+	$version_asc_expected = '3.14';
+	$version_desc_expected = '6.28';
+} elsif ($ENV{win32} eq 'yes') {
+	$version_asc_expected = '3.14,3.14';
+	$version_desc_expected = '6.28,6.28';
+} else {
+	plan skip_all => 'platform does not support libc collation versions';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+# setup
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_asc (PROVIDER=libc, LOCALE='ASC')]);
+$node->safe_psql('postgres',
+	qq[CREATE COLLATION test_desc (PROVIDER=libc, LOCALE='DESC')]);
+
+$node->safe_psql('postgres', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('postgres',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $pg_version = $node->safe_psql('postgres', qq[SELECT version()]);
+
+my $version_asc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, $version_asc_expected,
+	"collation test_asc has correct version $version_asc_expected"
+);
+
+my $version_desc =
+  $node->safe_psql('postgres',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, $version_desc_expected,
+	"collation test_desc has correct version $version_desc_expected"
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
index 599ec61239..5650840b4d 100644
--- a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
@@ -9,7 +9,7 @@
  * IDENTIFICATION
  *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
  *
- * Test implementation of icu-like collation provider.
+ * Test implementations of libc-like and icu-like collation providers.
  *
  * -------------------------------------------------------------------------
  */
@@ -21,6 +21,7 @@
 
 #include "test_collation_lib_hooks.h"
 
+static get_libc_library_hook_type prev_get_libc_library_hook = NULL;
 #ifdef USE_ICU
 static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
 #endif
@@ -36,8 +37,13 @@ _PG_init(void)
 	if (!process_shared_preload_libraries_in_progress)
 		ereport(ERROR, (errmsg("test_collation_lib_hooks must be loaded via shared_preload_libraries")));
 
+	prev_get_libc_library_hook = get_libc_library_hook;
+	get_libc_library_hook = test_get_libc_library;
+
 #ifdef USE_ICU
 	prev_get_icu_library_hook = get_icu_library_hook;
 	get_icu_library_hook = test_get_icu_library;
 #endif
+
+	init_libc_hook();
 }
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
index e6ee457ab3..94ea943b97 100644
--- a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
@@ -20,6 +20,10 @@
 #include "utils/pg_locale.h"
 #include "utils/pg_locale_internal.h"
 
+extern void init_libc_hook(void);
+extern pg_libc_library *test_get_libc_library(const char *collate,
+											  const char *ctype,
+											  const char *version);
 #ifdef USE_ICU
 extern pg_icu_library *test_get_icu_library(const char *locale,
 											const char *version);
diff --git a/src/test/modules/test_collation_lib_hooks/test_libc_hook.c b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
new file mode 100644
index 0000000000..a8eb5d8dd2
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_libc_hook.c
@@ -0,0 +1,555 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_libc_hook.c
+ *		Code for testing collation provider libc hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_libc_hook.c
+ *
+ * Implements a custom libc-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "C" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 3.14159, so loading it will cause a
+ * version mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef __GLIBC__
+#include <gnu/libc-version.h>
+#endif
+
+#ifdef WIN32
+#include <shlwapi.h>
+#endif
+
+#define LOCALE_NAME_LEN 64
+
+typedef struct test_locale_t
+{
+	bool reverse_sort;
+	bool reverse_case;
+	char lc_collate[LOCALE_NAME_LEN];
+	char lc_ctype[LOCALE_NAME_LEN];
+} test_locale_t;
+
+static pg_libc_library *test_libc_library = NULL;
+static test_locale_t current_setlocale = {
+	.lc_collate = "C",
+	.lc_ctype = "C",
+	.reverse_sort = false,
+	.reverse_case = false
+};
+
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+static test_locale_t *current_uselocale = &current_setlocale;
+#endif
+#endif
+
+#ifdef HAVE_LOCALE_T
+static locale_t c_locale_t = NULL;
+#endif
+
+void
+init_libc_hook()
+{
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	c_locale_t = newlocale(LC_ALL_MASK, "C", NULL);
+#else
+	c_locale_t = _create_locale(LC_ALL, "C");
+#endif
+#endif
+}
+
+#ifdef HAVE_LOCALE_T
+static test_locale_t *
+current_locale(void)
+{
+#ifndef WIN32
+	return current_uselocale;
+#else
+	return &current_setlocale;
+#endif
+}
+#endif
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static const char *
+test_libc_version(void)
+{
+	return "3.14159";
+}
+
+#ifdef WIN32
+bool
+test_GetNLSVersionEx(NLS_FUNCTION function, LPCWSTR lpLocaleName,
+					 LPNLSVERSIONINFOEX lpVersionInformation)
+{
+	Assert(function == COMPARE_STRING);
+	if (wcscmp(lpLocaleName, L"DESC") == 0)
+	{
+		lpVersionInformation->dwNLSVersion = (6 << 8) | 28;
+		lpVersionInformation->dwDefinedVersion = (6 << 8) | 28;
+	}
+	else
+	{
+		lpVersionInformation->dwNLSVersion = (3 << 8) | 14;
+		lpVersionInformation->dwDefinedVersion = (3 << 8) | 14;
+	}
+
+	return true;
+}
+#endif
+
+static char *
+test_setlocale(int category, const char *locale)
+{
+	Assert (category == LC_COLLATE || category == LC_CTYPE ||
+			category == LC_ALL);
+
+	if (category == LC_ALL)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_sort = locale_is_reverse(locale);
+			current_setlocale.reverse_case = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+		}
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_COLLATE)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_sort = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_collate, locale, LOCALE_NAME_LEN);
+		}
+
+		return current_setlocale.lc_collate;
+	}
+	else if (category == LC_CTYPE)
+	{
+		if (locale)
+		{
+			current_setlocale.reverse_case = locale_is_reverse(locale);
+			strncpy(current_setlocale.lc_ctype, locale, LOCALE_NAME_LEN);
+		}
+
+		return current_setlocale.lc_ctype;
+	}
+
+	return NULL;
+}
+
+#ifdef HAVE_LOCALE_T
+
+#ifndef WIN32
+
+static locale_t
+test_newlocale(int category, const char *locale, locale_t baselocale_t)
+{
+	test_locale_t *newloc;
+
+	Assert(baselocale_t != LC_GLOBAL_LOCALE);
+	Assert((test_locale_t *) baselocale_t != &current_setlocale);
+
+	if (baselocale_t == NULL)
+	{
+		newloc = MemoryContextAlloc(TopMemoryContext,
+									sizeof(test_locale_t));
+		strncpy(newloc->lc_collate, "C", LOCALE_NAME_LEN);
+		strncpy(newloc->lc_ctype, "C", LOCALE_NAME_LEN);
+		newloc->reverse_sort = false;
+		newloc->reverse_case = false;
+	}
+	else
+		newloc = (test_locale_t *) baselocale_t;
+
+	if ((category & LC_COLLATE_MASK) != 0)
+	{
+		newloc->reverse_sort = locale_is_reverse(locale);
+		strncpy(newloc->lc_collate, locale, LOCALE_NAME_LEN);
+	}
+	if ((category & LC_CTYPE_MASK) != 0)
+	{
+		newloc->reverse_case = locale_is_reverse(locale);
+		strncpy(newloc->lc_ctype, locale, LOCALE_NAME_LEN);
+	}
+
+	return (locale_t) newloc;
+}
+
+static void
+test_freelocale(locale_t loc)
+{
+	Assert(loc != LC_GLOBAL_LOCALE);
+	Assert((test_locale_t *)loc != &current_setlocale);
+	pfree(loc);
+}
+
+static locale_t
+test_uselocale(locale_t loc)
+{
+	test_locale_t *result = current_uselocale;
+
+	if (loc != NULL)
+	{
+		if (loc == LC_GLOBAL_LOCALE)
+			current_uselocale = &current_setlocale;
+		else
+			current_uselocale = (test_locale_t *) loc;
+	}
+
+	if (result == &current_setlocale)
+		return LC_GLOBAL_LOCALE;
+	else
+		return (locale_t) result;
+}
+
+#ifdef LC_VERSION_MASK
+static const char *
+test_querylocale(int mask, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *)locale;
+	Assert((mask & LC_VERSION_MASK) != 0);
+	if (testlocale->reverse_sort)
+		return "6.28";
+	else
+		return "3.14";
+}
+#endif			/* LC_VERSION_MASK */
+
+#else			/* WIN32 */
+static locale_t
+_test_create_locale(int category, const char *locale)
+{
+	test_locale_t *newloc;
+
+	newloc = MemoryContextAlloc(TopMemoryContext,
+								sizeof(test_locale_t));
+	strncpy(newloc->lc_collate, "C", LOCALE_NAME_LEN);
+	strncpy(newloc->lc_ctype, "C", LOCALE_NAME_LEN);
+	newloc->reverse_sort = false;
+	newloc->reverse_case = false;
+
+	if (category == LC_ALL || category == LC_COLLATE)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_sort = true;
+		else
+			newloc->reverse_sort = false;
+		strncpy(newloc->lc_collate, locale, LOCALE_NAME_LEN);
+	}
+	if (category == LC_ALL || category == LC_CTYPE)
+	{
+		if (locale_is_reverse(locale))
+			newloc->reverse_case = true;
+		else
+			newloc->reverse_case = false;
+		strncpy(newloc->lc_ctype, locale, LOCALE_NAME_LEN);
+	}
+
+	return (locale_t) newloc;
+}
+#endif			/* WIN32 */
+
+#endif			/* HAVE_LOCALE_T */
+
+static size_t
+test_wcstombs(char *dest, const wchar_t *src, size_t n)
+{
+	return wcstombs(dest, src, n);
+}
+
+static size_t
+test_mbstowcs(wchar_t *dest, const char *src, size_t n)
+{
+	return mbstowcs(dest, src, n);
+}
+
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+static size_t
+test_wcstombs_l(char *dest, const wchar_t *src, size_t n, locale_t loc)
+{
+	return wcstombs(dest, src, n);
+}
+#endif
+#ifdef HAVE_MBSTOWCS_L
+static size_t
+test_mbstowcs_l(wchar_t *dest, const char *src, size_t n, locale_t loc)
+{
+	return mbstowcs(dest, src, n);
+}
+#endif
+#endif
+
+static int
+test_strcoll_internal(const char *s1, const char *s2, bool reverse)
+{
+	int ret = strcmp(s1, s2);
+	return reverse ? -ret : ret;
+}
+
+static int
+test_strcoll(const char *s1, const char *s2)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_strcoll_internal(s1, s2, reverse);
+}
+
+static int
+test_wcscoll_internal(const wchar_t *ws1, const wchar_t *ws2, bool reverse)
+{
+	int ret = wcscmp(ws1, ws2);
+	return reverse ? -ret : ret;
+}
+static int
+test_wcscoll(const wchar_t *ws1, const wchar_t *ws2)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_wcscoll_internal(ws1, ws2, reverse);
+}
+
+static size_t
+test_strxfrm_internal(char *s1, const char *s2, size_t n, bool reverse)
+{
+	size_t			 result_size = strlen(s2) + 1;
+
+	if (n > result_size)
+	{
+		strncpy(s1, s2, n);
+		s1[result_size] = '\0';
+
+		if (reverse)
+		{
+			unsigned char *dest = (unsigned char *)s1;
+			for (int i = 0; i < result_size; i++)
+				dest[i] ^= (unsigned char) 0xFF;
+		}
+	}
+
+	return result_size;
+}
+
+static size_t
+test_strxfrm(char *s1, const char * s2, size_t n)
+{
+	bool reverse = current_locale()->reverse_sort;
+	return test_strxfrm_internal(s1, s2, n, reverse);
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_strcoll_l(const char *s1, const char *s2, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *)loc;
+	bool reverse = testlocale->reverse_sort;
+	return test_strcoll_internal(s1, s2, reverse);
+}
+
+static int
+test_wcscoll_l(const wchar_t *ws1, const wchar_t *ws2, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_sort;
+	return test_wcscoll_internal(ws1, ws2, reverse);
+}
+
+static size_t
+test_strxfrm_l(char *s1, const char * s2, size_t n, locale_t loc)
+{
+	test_locale_t *testlocale = (test_locale_t *) loc;
+	bool reverse = testlocale->reverse_sort;
+	return test_strxfrm_internal(s1, s2, n, reverse);
+}
+#endif			 /* HAVE_LOCALE_T */
+
+static int
+test_case_internal(int c, bool toupper)
+{
+	if (toupper && ('a' <= c && c <= 'z'))
+		return c - ('a' - 'A');
+	else if (!toupper && ('A' <= c && c <= 'Z'))
+		return c + ('a' - 'A');
+	else
+		return c;
+}
+
+static int
+test_tolower(int c)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_case_internal(c, reverse ? true : false);
+}
+
+static int
+test_toupper(int c)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_case_internal(c, reverse ? false : true);
+}
+
+static int
+test_iswalnum_internal(wint_t wc)
+{
+	if (('A' <= wc && wc <= 'Z') ||
+		('a' <= wc && wc <= 'z') ||
+		('0' <= wc && wc <= '9'))
+		return 1;
+	return 0;
+}
+
+static int
+test_iswalnum(wint_t wc)
+{
+	return test_iswalnum_internal(wc);
+}
+
+static wint_t
+test_wcase_internal(wint_t wc, bool toupper)
+{
+	if (toupper && ('a' <= wc && wc <= 'z'))
+		return wc - ('a' - 'A');
+	else if (!toupper && ('A' <= wc && wc <= 'Z'))
+		return wc + ('a' - 'A');
+	else
+		return wc;
+}
+
+static wint_t
+test_towlower(wint_t wc)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_wcase_internal(wc, reverse ? true : false);
+}
+
+static wint_t
+test_towupper(wint_t wc)
+{
+	bool reverse = current_locale()->reverse_case;
+	return test_wcase_internal(wc, reverse ? false : true);
+}
+
+#ifdef HAVE_LOCALE_T
+static int
+test_tolower_l(int c, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_case_internal(c, reverse ? true : false);
+}
+
+static int
+test_toupper_l(int c, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_case_internal(c, reverse ? false : true);
+}
+
+static int
+test_iswalnum_l(wint_t wc, locale_t locale)
+{
+	return test_iswalnum_internal(wc);
+}
+
+static wint_t
+test_towlower_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_wcase_internal(wc, reverse ? true : false);
+}
+
+static wint_t
+test_towupper_l(wint_t wc, locale_t locale)
+{
+	test_locale_t *testlocale = (test_locale_t *) locale;
+	bool reverse = testlocale->reverse_case;
+	return test_wcase_internal(wc, reverse ? false : true);
+}
+#endif			 /* HAVE_LOCALE_T */
+
+pg_libc_library *
+test_get_libc_library(const char *collate, const char *ctype,
+					  const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (test_libc_library != NULL)
+		return test_libc_library;
+
+	ereport(LOG, (errmsg("loading custom libc provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_libc_library));
+#if defined(__GLIBC__)
+	lib->libc_version = test_libc_version;
+#elif defined(WIN32)
+	lib->GetNLSVersionEx = test_GetNLSVersionEx;
+#endif
+	lib->c_setlocale = test_setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = test_newlocale;
+	lib->c_freelocale = test_freelocale;
+	lib->c_uselocale = test_uselocale;
+#ifdef LC_VERSION_MASK
+	lib->c_querylocale = test_querylocale;
+#endif
+#else
+	lib->_create_locale = _test_create_locale;
+#endif
+#endif
+	lib->c_wcstombs = test_wcstombs;
+	lib->c_mbstowcs = test_mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = test_wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = test_mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = test_strcoll;
+	lib->c_wcscoll = test_wcscoll;
+	lib->c_strxfrm = test_strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = test_strcoll_l;
+	lib->c_wcscoll_l = test_wcscoll_l;
+	lib->c_strxfrm_l = test_strxfrm_l;
+#endif
+	lib->c_tolower = test_tolower;
+	lib->c_toupper = test_toupper;
+	lib->c_iswalnum = test_iswalnum;
+	lib->c_towlower = test_towlower;
+	lib->c_towupper = test_towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_tolower_l = test_tolower_l;
+	lib->c_toupper_l = test_toupper_l;
+	lib->c_iswalnum_l = test_iswalnum_l;
+	lib->c_towlower_l = test_towlower_l;
+	lib->c_towupper_l = test_towupper_l;
+#endif
+
+	test_libc_library = lib;
+	return lib;
+}
-- 
2.34.1

v6-0006-Support-multiple-libc-collation-provider-librarie.patchtext/x-patch; charset=UTF-8; name=v6-0006-Support-multiple-libc-collation-provider-librarie.patchDownload

From da052ddb6ab9748f794f89e4fdf0433be4532da2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 14 Dec 2022 15:12:35 -0800
Subject: [PATCH v6 6/7] Support multiple libc collation provider libraries.

Introduce structure pg_libc_library, which holds pointers to each
required libc method related to locale_t, encoding, collation, or
ctype. Store this as part of pg_locale_t. Each call to one of these
libc functions instead goes through this structure, so that it can
more easily be replaced by a non-builtin libc library.

Offer a hook to allow an extension to control which libc library
is loaded.

Adds a test module to test the hook functionality.

TODO: loose ends in regex code and possibly other places where libc
functions are called directly.

Author: Jeff Davis
---
 src/backend/utils/adt/formatting.c     |  50 +++++--
 src/backend/utils/adt/pg_locale.c      | 199 +++++++++++++++++++------
 src/include/utils/pg_locale_internal.h |  85 ++++++++++-
 3 files changed, 274 insertions(+), 60 deletions(-)

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 40bfad075b..ab98f4a9ea 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -69,7 +69,6 @@
 #include <math.h>
 #include <float.h>
 #include <limits.h>
-#include <wctype.h>
 
 #ifdef USE_ICU
 #include <unicode/ustring.h>
@@ -1730,10 +1729,16 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towlower(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1763,7 +1768,11 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						*p = libc->c_tolower_l((unsigned char) *p,
+											   mylocale->info.libc.lt);
+					}
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1853,10 +1862,16 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+					}
 					else
 #endif
-						workspace[curr_char] = towupper(workspace[curr_char]);
+					{
+						pg_libc_library *libc = get_default_libc_library();
+						workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
+					}
 				}
 
 				/*
@@ -1886,7 +1901,11 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
+						*p = libc->c_toupper_l((unsigned char) *p,
+											   mylocale->info.libc.lt);
+					}
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1978,19 +1997,21 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
 					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+							workspace[curr_char] = libc->c_towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
 					{
+						pg_libc_library *libc = get_default_libc_library();
 						if (wasalnum)
-							workspace[curr_char] = towlower(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towlower(workspace[curr_char]);
 						else
-							workspace[curr_char] = towupper(workspace[curr_char]);
+							workspace[curr_char] = libc->c_towupper(workspace[curr_char]);
 						wasalnum = iswalnum(workspace[curr_char]);
 					}
 				}
@@ -2023,10 +2044,13 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
 					{
+						pg_libc_library *libc = PG_LIBC_LIB(mylocale);
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
+							*p = libc->c_tolower_l((unsigned char) *p,
+												   mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+							*p = libc->c_toupper_l((unsigned char) *p,
+												   mylocale->info.libc.lt);
 						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 2c911fb111..458e97e712 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -115,19 +115,19 @@ char	   *localized_full_months[12 + 1];
 /*
  * The precise version of a collation provider library is important, because
  * subtle changes in collation between library versions can corrupt
- * indexes. This hook allows control over how collation provider libraries are
- * loaded.
+ * indexes. These hooks allows control over how collation provider libraries
+ * are loaded.
  *
  * If the hook is not set, or if it returns NULL, Postgres constructs the
- * pg_icu_library structure from the symbols Postgres is linked with at build
- * time.
+ * pg_libc_library or pg_icu_library structures from the symbols Postgres is
+ * linked with at build time.
  *
- * If the hook is set, it can instead construct the pg_icu_library structure
- * using custom logic. Ordinarily, this custom logic would involve finding a
- * specific known version of the collation provider library, and dynamically
- * loading the necessary symbols. If the collation version obtained from the
- * library does not match the collation version recorded in the catalog,
- * Postgres will issue a WARNING.
+ * If the hook is set, it can instead construct the pg_libc_library or
+ * pg_icu_library structures using custom logic. Ordinarily, this custom logic
+ * would involve finding a specific known version of the collation provider
+ * library, and dynamically loading the necessary symbols. If the collation
+ * version obtained from the library does not match the collation version
+ * recorded in the catalog, Postgres will issue a WARNING.
  *
  * The returned structure must be allocated in TopMemoryContext, and the
  * associated symbols must remain valid permanently. It's expected that the
@@ -139,10 +139,12 @@ char	   *localized_full_months[12 + 1];
  * consider what other information can be provided to allow for greater
  * control over which library is loaded.
  */
+get_libc_library_hook_type get_libc_library_hook = NULL;
 #ifdef USE_ICU
 get_icu_library_hook_type get_icu_library_hook = NULL;
 #endif
 
+static pg_libc_library *builtin_libc_library = NULL;
 #ifdef USE_ICU
 static pg_icu_library *builtin_icu_library = NULL;
 #endif
@@ -173,6 +175,7 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+static pg_libc_library *default_libc = NULL;
 #ifdef USE_ICU
 static pg_icu_library *default_icu = NULL;
 #endif
@@ -1392,7 +1395,7 @@ lc_collate_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_COLLATE, NULL);
+		localeptr = default_libc->c_setlocale(LC_COLLATE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_COLLATE setting");
 
@@ -1445,7 +1448,7 @@ lc_ctype_is_c(Oid collation)
 
 		if (result >= 0)
 			return (bool) result;
-		localeptr = setlocale(LC_CTYPE, NULL);
+		localeptr = default_libc->c_setlocale(LC_CTYPE, NULL);
 		if (!localeptr)
 			elog(ERROR, "invalid LC_CTYPE setting");
 
@@ -1503,6 +1506,82 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+static pg_libc_library *
+get_builtin_libc_library()
+{
+	pg_libc_library *lib = NULL;
+
+	if (builtin_libc_library != NULL)
+		return builtin_libc_library;
+
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_libc_library));
+#if defined(__GLIBC__)
+	lib->libc_version = gnu_get_libc_version;
+#elif defined(WIN32)
+	lib->GetNLSVersionEx = GetNLSVersionEx;
+#endif
+	lib->c_setlocale = setlocale;
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	lib->c_newlocale = newlocale;
+	lib->c_freelocale = freelocale;
+	lib->c_uselocale = uselocale;
+#ifdef LC_VERSION_MASK
+	lib->c_querylocale = querylocale;
+#endif		/* LC_VERSION_MASK */
+#else
+	lib->_create_locale = _create_locale;
+#endif		/* WIN32 */
+#endif		/* HAVE_LOCALE_T */
+	lib->c_wcstombs = wcstombs;
+	lib->c_mbstowcs = mbstowcs;
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	lib->c_wcstombs_l = wcstombs_l;
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	lib->c_mbstowcs_l = mbstowcs_l;
+#endif
+#endif
+	lib->c_strcoll = strcoll;
+	lib->c_wcscoll = wcscoll;
+	lib->c_strxfrm = strxfrm;
+#ifdef HAVE_LOCALE_T
+	lib->c_strcoll_l = strcoll_l;
+	lib->c_wcscoll_l = wcscoll_l;
+	lib->c_strxfrm_l = strxfrm_l;
+#endif
+	lib->c_tolower = tolower;
+	lib->c_toupper = toupper;
+	lib->c_iswalnum = iswalnum;
+	lib->c_towlower = towlower;
+	lib->c_towupper = towupper;
+#ifdef HAVE_LOCALE_T
+	lib->c_tolower_l = tolower_l;
+	lib->c_toupper_l = toupper_l;
+	lib->c_iswalnum_l = iswalnum_l;
+	lib->c_towlower_l = towlower_l;
+	lib->c_towupper_l = towupper_l;
+#endif
+
+	builtin_libc_library = lib;
+	return lib;
+}
+
+static pg_libc_library *
+get_libc_library(const char *collate, const char *ctype, const char *version)
+{
+	pg_libc_library *lib = NULL;
+
+	if (get_libc_library_hook != NULL)
+		lib = get_libc_library_hook(collate, ctype, version);
+
+	if(!lib)
+		lib = get_builtin_libc_library();
+
+	return lib;
+}
+
 #ifdef USE_ICU
 static pg_icu_library *
 get_builtin_icu_library()
@@ -1614,6 +1693,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	if (provider == COLLPROVIDER_LIBC)
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = get_libc_library(collate, ctype, version);
 		locale_t        loc;
 
 		/* newlocale's result may be leaked if we encounter an error */
@@ -1623,10 +1703,10 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			/* Normal case where they're the same */
 			errno = 0;
 #ifndef WIN32
-			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+			loc = libc->c_newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
 							NULL);
 #else
-			loc = _create_locale(LC_ALL, collate);
+			loc = libc->_create_locale(LC_ALL, collate);
 #endif
 			if (!loc)
 				report_newlocale_failure(collate);
@@ -1638,11 +1718,11 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 			locale_t	loc1;
 
 			errno = 0;
-			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			loc1 = libc->c_newlocale(LC_COLLATE_MASK, collate, NULL);
 			if (!loc1)
 				report_newlocale_failure(collate);
 			errno = 0;
-			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			loc = libc->c_newlocale(LC_CTYPE_MASK, ctype, loc1);
 			if (!loc)
 				report_newlocale_failure(ctype);
 #else
@@ -1659,6 +1739,7 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 		}
 
 		result->info.libc.lt = loc;
+		result->info.libc.lib = libc;
 #else							/* not HAVE_LOCALE_T */
 		/* platform that doesn't support locale_t */
 		ereport(ERROR,
@@ -1710,6 +1791,12 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+pg_libc_library *
+get_default_libc_library()
+{
+	return default_libc;
+}
+
 #ifdef USE_ICU
 pg_icu_library *
 get_default_icu_library()
@@ -1725,6 +1812,19 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+	default_libc = get_libc_library(collate, ctype, version);
+
+	/*
+	 * If it's the builtin libc, pg_perm_setlocale() will handle
+	 * setlocale(). Otherwise, it must be done here so that the collate and
+	 * ctype take effect.
+	 */
+	if (default_libc != builtin_libc_library)
+	{
+		default_libc->c_setlocale(LC_COLLATE, collate);
+		default_libc->c_setlocale(LC_CTYPE, ctype);
+	}
+
 #ifdef USE_ICU
 	default_icu = get_icu_library(iculocale, version);
 #endif
@@ -1913,19 +2013,23 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 			pg_strncasecmp("C.", collcollate, 2) != 0 &&
 			pg_strcasecmp("POSIX", collcollate) != 0)
 	{
+		pg_libc_library *libc = get_libc_library(collcollate, NULL, NULL);
+
 #if defined(__GLIBC__)
 		/* Use the glibc version because we don't have anything better. */
-		collversion = pstrdup(gnu_get_libc_version());
+		if (libc->libc_version != NULL)
+			collversion = pstrdup(libc->libc_version());
 #elif defined(LC_VERSION_MASK)
 		locale_t	loc;
 
 		/* Look up FreeBSD collation version. */
-		loc = newlocale(LC_COLLATE, collcollate, NULL);
+		loc = libc->c_newlocale(LC_COLLATE, collcollate, NULL);
 		if (loc)
 		{
-			collversion =
-				pstrdup(querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, loc));
-			freelocale(loc);
+			const char *query = libc->c_querylocale(
+				LC_COLLATE_MASK | LC_VERSION_MASK, loc);
+			collversion = pstrdup(query);
+			libc->c_freelocale(loc);
 		}
 		else
 			ereport(ERROR,
@@ -1941,7 +2045,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 							LOCALE_NAME_MAX_LENGTH);
-		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
+		if (!libc->GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version))
 		{
 			/*
 			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
@@ -2031,10 +2135,14 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		result = libc->c_wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
+	}
 	else
 #endif
-		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+		result = default_libc->c_wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+
 	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
 								 * headers */
 		ereport(ERROR,
@@ -2060,7 +2168,6 @@ static int
 pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 {
 	int result;
-
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 #ifdef WIN32
 	if (GetDatabaseEncoding() == PG_UTF8)
@@ -2074,14 +2181,16 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
+		result = libc->c_strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
 #endif
 	}
 	else
-		result = strcoll(arg1, arg2);
+		result = default_libc->c_strcoll(arg1, arg2);
 
 	return result;
 }
@@ -2327,17 +2436,15 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 {
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 
-#ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	{
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+		return libc->c_strxfrm_l(dest, src, destsize, locale->info.libc.lt);
+	}
 	else
 #endif
-		return strxfrm(dest, src, destsize);
-#else
-	/* shouldn't happen */
-	elog(ERROR, "unsupported collprovider: %c", locale->provider);
-#endif
+		return default_libc->c_strxfrm(dest, src, destsize);
 }
 
 static size_t
@@ -3010,21 +3117,23 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 	if (locale == (pg_locale_t) 0)
 	{
 		/* Use wcstombs directly for the default locale */
-		result = wcstombs(to, from, tolen);
+		result = default_libc->c_wcstombs(to, from, tolen);
 	}
 	else
 	{
 #ifdef HAVE_LOCALE_T
+		pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
+		result = libc->c_wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.libc.lt);
+		locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-		result = wcstombs(to, from, tolen);
+		result = libc->c_wcstombs(to, from, tolen);
 
-		uselocale(save_locale);
+		libc->c_uselocale(save_locale);
 #endif							/* HAVE_WCSTOMBS_L */
 #else							/* !HAVE_LOCALE_T */
 		/* Can't have locale != 0 without HAVE_LOCALE_T */
@@ -3087,21 +3196,23 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 		if (locale == (pg_locale_t) 0)
 		{
 			/* Use mbstowcs directly for the default locale */
-			result = mbstowcs(to, str, tolen);
+			result = default_libc->c_mbstowcs(to, str, tolen);
 		}
 		else
 		{
 #ifdef HAVE_LOCALE_T
+			pg_libc_library *libc = PG_LIBC_LIB(locale);
+
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
+			result = libc->c_mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.libc.lt);
+			locale_t	save_locale = libc->c_uselocale(locale->info.libc.lt);
 
-			result = mbstowcs(to, str, tolen);
+			result = libc->c_mbstowcs(to, str, tolen);
 
-			uselocale(save_locale);
+			libc->c_uselocale(save_locale);
 #endif							/* HAVE_MBSTOWCS_L */
 #else							/* !HAVE_LOCALE_T */
 			/* Can't have locale != 0 without HAVE_LOCALE_T */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index afb6c24d64..66a0e92a76 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -13,6 +13,9 @@
 #ifndef _PG_LOCALE_INTERNAL_
 #define _PG_LOCALE_INTERNAL_
 
+#include <wchar.h>
+#include <wctype.h>
+
 #ifdef USE_ICU
 #include <unicode/ubrk.h>
 #include <unicode/ucnv.h>
@@ -32,6 +35,75 @@
 #endif
 #endif
 
+typedef struct pg_libc_library
+{
+	/* version */
+#if defined(__GLIBC__)
+	const char *(*libc_version) (void);
+#elif defined(WIN32)
+	BOOL (*GetNLSVersionEx) (NLS_FUNCTION function, LPCWSTR lpLocaleName,
+							 LPNLSVERSIONINFOEX lpVersionInformation);
+#endif
+
+	/* locale */
+	char *(*c_setlocale) (int category, const char *locale);
+#ifdef HAVE_LOCALE_T
+#ifndef WIN32
+	locale_t (*c_newlocale) (int category_mask, const char *locale,
+						   locale_t base);
+	void (*c_freelocale) (locale_t locobj);
+	locale_t (*c_uselocale) (locale_t newloc);
+#ifdef LC_VERSION_MASK
+	const char *(*c_querylocale) (int mask, locale_t locale);
+#endif
+#else
+	locale_t (*_create_locale) (int category, const char *locale);
+#endif
+#endif
+
+	/* encoding */
+	size_t (*c_wcstombs) (char *dest, const wchar_t *src, size_t n);
+	size_t (*c_mbstowcs) (wchar_t *dest, const char *src, size_t n);
+#ifdef HAVE_LOCALE_T
+#ifdef HAVE_WCSTOMBS_L
+	size_t (*c_wcstombs_l) (char *dest, const wchar_t *src, size_t n,
+						  locale_t loc);
+#endif
+#ifdef HAVE_MBSTOWCS_L
+	size_t (*c_mbstowcs_l) (wchar_t *dest, const char *src, size_t n,
+						  locale_t loc);
+#endif
+#endif
+
+	/* collation */
+	int (*c_strcoll) (const char *s1, const char *s2);
+	int (*c_wcscoll) (const wchar_t *ws1, const wchar_t *ws2);
+	size_t (*c_strxfrm) (char *s1, const char * s2, size_t n);
+#ifdef HAVE_LOCALE_T
+	int (*c_strcoll_l) (const char *s1, const char *s2, locale_t locale);
+	int (*c_wcscoll_l) (const wchar_t *ws1, const wchar_t *ws2,
+					  locale_t locale);
+	size_t (*c_strxfrm_l) (char *s1, const char * s2, size_t n,
+						 locale_t locale);
+#endif
+
+	/* ctype */
+	int (*c_tolower) (int c);
+	int (*c_toupper) (int c);
+	int (*c_iswalnum) (wint_t wc);
+	wint_t (*c_towlower) (wint_t wc);
+	wint_t (*c_towupper) (wint_t wc);
+#ifdef HAVE_LOCALE_T
+	int (*c_tolower_l) (int c, locale_t locale);
+	int (*c_toupper_l) (int c, locale_t locale);
+	int (*c_iswalnum_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towlower_l) (wint_t wc, locale_t locale);
+	wint_t (*c_towupper_l) (wint_t wc, locale_t locale);
+#endif
+} pg_libc_library;
+
+#define PG_LIBC_LIB(x) ((x)->info.libc.lib)
+
 #ifdef USE_ICU
 /*
  * An ICU library version that we're either linked against or have loaded at
@@ -146,12 +218,13 @@ struct pg_locale_struct
 	char	   *ctype;
 	union
 	{
-#ifdef HAVE_LOCALE_T
 		struct
 		{
+#ifdef HAVE_LOCALE_T
 			locale_t	lt;
-		}			libc;
 #endif
+			pg_libc_library *lib;
+		}			libc;
 #ifdef USE_ICU
 		struct
 		{
@@ -159,10 +232,16 @@ struct pg_locale_struct
 			pg_icu_library	*lib;
 		}			icu;
 #endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
 	}			info;
 };
 
+typedef pg_libc_library *(*get_libc_library_hook_type)(
+	const char *collate, const char *ctype, const char *version);
+
+extern PGDLLIMPORT get_libc_library_hook_type get_libc_library_hook;
+
+extern pg_libc_library *get_default_libc_library(void);
+
 #ifdef USE_ICU
 
 typedef pg_icu_library *(*get_icu_library_hook_type)(
-- 
2.34.1

v6-0005-Add-test-module-for-icu-collation-provider-hook.patchtext/x-patch; charset=UTF-8; name=v6-0005-Add-test-module-for-icu-collation-provider-hook.patchDownload

From 986d44ceeb6af2fffcd97f4afacad2bcef25feaa Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 21 Dec 2022 19:50:10 -0800
Subject: [PATCH v6 5/7] Add test module for icu collation provider hook.

---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/test_collation_lib_hooks/Makefile |  24 ++
 .../test_collation_lib_hooks/meson.build      |  37 +++
 .../test_collation_lib_hooks/t/001_icu.pl     | 153 ++++++++++++
 .../test_collation_lib_hooks.c                |  43 ++++
 .../test_collation_lib_hooks.control          |   4 +
 .../test_collation_lib_hooks.h                |  28 +++
 .../test_collation_lib_hooks/test_icu_hook.c  | 228 ++++++++++++++++++
 9 files changed, 519 insertions(+)
 create mode 100644 src/test/modules/test_collation_lib_hooks/Makefile
 create mode 100644 src/test/modules/test_collation_lib_hooks/meson.build
 create mode 100644 src/test/modules/test_collation_lib_hooks/t/001_icu.pl
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
 create mode 100644 src/test/modules/test_collation_lib_hooks/test_icu_hook.c

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..261bf5e729 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -15,6 +15,7 @@ SUBDIRS = \
 		  snapshot_too_old \
 		  spgist_name_ops \
 		  test_bloomfilter \
+		  test_collation_lib_hooks \
 		  test_copy_callbacks \
 		  test_custom_rmgrs \
 		  test_ddl_deparse \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..93ff0768c9 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -11,6 +11,7 @@ subdir('snapshot_too_old')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
+subdir('test_collation_lib_hooks')
 subdir('test_copy_callbacks')
 subdir('test_custom_rmgrs')
 subdir('test_ddl_deparse')
diff --git a/src/test/modules/test_collation_lib_hooks/Makefile b/src/test/modules/test_collation_lib_hooks/Makefile
new file mode 100644
index 0000000000..05948e555a
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/Makefile
@@ -0,0 +1,24 @@
+# src/test/modules/test_collation_lib_hooks/Makefile
+
+MODULE_big = test_collation_lib_hooks
+OBJS = \
+	$(WIN32RES) \
+	test_collation_lib_hooks.o test_icu_hook.o
+PGFILEDESC = "test_collation_lib_hooks - test collation provider library hooks"
+
+EXTENSION = test_collation_lib_hooks
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_collation_lib_hooks
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+export with_icu
diff --git a/src/test/modules/test_collation_lib_hooks/meson.build b/src/test/modules/test_collation_lib_hooks/meson.build
new file mode 100644
index 0000000000..56b32b6cd1
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/meson.build
@@ -0,0 +1,37 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_collation_lib_hooks_sources = files(
+  'test_collation_lib_hooks.c',
+  'test_icu_hook.c',
+)
+
+if host_system == 'windows'
+  test_collation_lib_hooks_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_collation_lib_hooks',
+    '--FILEDESC', 'test_collation_lib_hooks - test collation provider library hooks',])
+endif
+
+test_collation_lib_hooks = shared_module('test_collation_lib_hooks',
+  test_collation_lib_hooks_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_collation_lib_hooks
+
+install_data(
+  'test_collation_lib_hooks.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_collation_lib_hooks',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_icu.pl',
+    ],
+    'env': {
+      'with_icu': icu.found() ? 'yes' : 'no',
+    },
+  },
+}
diff --git a/src/test/modules/test_collation_lib_hooks/t/001_icu.pl b/src/test/modules/test_collation_lib_hooks/t/001_icu.pl
new file mode 100644
index 0000000000..e6f5372445
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/t/001_icu.pl
@@ -0,0 +1,153 @@
+# Copyright (c) 2021-2022, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{with_icu} ne 'yes')
+{
+	plan skip_all => 'ICU not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+$node->append_conf(
+	'postgresql.conf', q{
+shared_preload_libraries = 'test_collation_lib_hooks'
+});
+$node->start;
+
+$node->safe_psql('postgres',
+	q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' ICU_LOCALE 'DESC' ENCODING 'UTF8' TEMPLATE template0}
+);
+
+# setup
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_asc (PROVIDER=icu, LOCALE='ASC')]);
+$node->safe_psql('dbicu',
+	qq[CREATE COLLATION test_desc (PROVIDER=icu, LOCALE='DESC')]);
+
+$node->safe_psql('dbicu', qq[CREATE TABLE strings(t text)]);
+$node->safe_psql('dbicu',
+	qq[INSERT INTO strings VALUES ('aBcD'), ('fGhI'), ('wXyZ')]);
+
+# check versions
+
+my $version_db =
+  $node->safe_psql('dbicu',
+	  qq[SELECT datcollversion FROM pg_database WHERE datname='dbicu']);
+is($version_db, '2.72',
+	'database "dbicu" has correct version 2.72'
+);
+
+my $version_asc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_asc']);
+is($version_asc, '2.72',
+	'collation "test_asc" has correct version 2.72'
+);
+
+my $version_desc =
+  $node->safe_psql('dbicu',
+	  qq[SELECT collversion FROM pg_collation WHERE collname='test_desc']);
+is($version_desc, '2.72',
+	'collation "test_desc" has correct version 2.72'
+);
+
+my $res_sort_expected = "aBcD
+fGhI
+wXyZ";
+
+my $res_reversesort_expected = "wXyZ
+fGhI
+aBcD";
+
+# test comparison
+
+my $comparison =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_asc < 'wXyZ' COLLATE test_asc]);
+is($comparison, 't',
+	'correct comparison'
+);
+
+# test reverse comparison (database)
+
+my $dbcomparison_reverse =
+  $node->safe_psql('dbicu', qq[SELECT 'aBcD' < 'wXyZ']);
+is($dbcomparison_reverse, 'f',
+	'correct reverse comparison (database)'
+);
+
+# test reverse comparison
+
+my $comparison_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT 'aBcD' COLLATE test_desc < 'wXyZ' COLLATE test_desc]);
+is($comparison_reverse, 'f',
+	'correct reverse comparison'
+);
+
+# test asc sort
+
+my $res_sort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_asc]);
+is($res_sort, $res_sort_expected,
+	'correct ascending sort'
+);
+
+# test desc sort
+
+my $res_db_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t]);
+is($res_db_reversesort, $res_reversesort_expected,
+	'correct descending sort (database)'
+);
+
+# test desc sort
+
+my $res_reversesort =
+  $node->safe_psql('dbicu',
+	  qq[SELECT t FROM strings ORDER BY t COLLATE test_desc]);
+is($res_reversesort, $res_reversesort_expected,
+	'correct descending sort'
+);
+
+# test lower/upper
+
+my $tcase =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_asc),
+                upper('aBcDfgHiwXyZ' collate test_asc)]);
+is($tcase, 'abcdfghiwxyz|ABCDFGHIWXYZ',
+	'correct lowercase and uppercase'
+);
+
+# test reverse lower/upper (database)
+
+my $tcase_db_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ'),
+                upper('aBcDfgHiwXyZ')]);
+is($tcase_db_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase (database)'
+);
+
+# test reverse lower/upper
+
+my $tcase_reverse =
+  $node->safe_psql('dbicu',
+	  qq[SELECT lower('aBcDfgHiwXyZ' collate test_desc),
+                upper('aBcDfgHiwXyZ' collate test_desc)]);
+is($tcase_reverse, 'ABCDFGHIWXYZ|abcdfghiwxyz',
+	'correct reverse lowercase and uppercase'
+);
+
+$node->stop;
+done_testing();
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
new file mode 100644
index 0000000000..599ec61239
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
@@ -0,0 +1,43 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.c
+ *		Code for testing collation provider library hooks
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.c
+ *
+ * Test implementation of icu-like collation provider.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "miscadmin.h"
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef USE_ICU
+static get_icu_library_hook_type prev_get_icu_library_hook = NULL;
+#endif
+
+PG_MODULE_MAGIC;
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+	if (!process_shared_preload_libraries_in_progress)
+		ereport(ERROR, (errmsg("test_collation_lib_hooks must be loaded via shared_preload_libraries")));
+
+#ifdef USE_ICU
+	prev_get_icu_library_hook = get_icu_library_hook;
+	get_icu_library_hook = test_get_icu_library;
+#endif
+}
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
new file mode 100644
index 0000000000..a0b8e031a4
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.control
@@ -0,0 +1,4 @@
+comment = 'Test code for collation provider library hooks'
+default_version = '1.0'
+module_pathname = '$libdir/test_collation_lib_hooks'
+
diff --git a/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
new file mode 100644
index 0000000000..e6ee457ab3
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
@@ -0,0 +1,28 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_collation_lib_hooks.h
+ *		Definitions for collation library hooks.
+ *
+ * Copyright (c) 2015-2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_collation_lib_hooks.h
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#ifndef TEST_COLLATION_LIB_HOOKS_H
+#define TEST_COLLATION_LIB_HOOKS_H
+
+#include "postgres.h"
+
+#include "utils/memutils.h"
+#include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
+
+#ifdef USE_ICU
+extern pg_icu_library *test_get_icu_library(const char *locale,
+											const char *version);
+#endif
+
+#endif
diff --git a/src/test/modules/test_collation_lib_hooks/test_icu_hook.c b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
new file mode 100644
index 0000000000..ae257cc03b
--- /dev/null
+++ b/src/test/modules/test_collation_lib_hooks/test_icu_hook.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_icu_hook.c
+ *		Code for testing collation provider icu hook.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_collation_lib_hooks/test_icu_hook.c
+ *
+ * Implements a custom icu-like collation provider library for testing the
+ * hooks. It accepts any collation name requested. All behave exactly like the
+ * "en_US" locale, except for the locale named "DESC", which reverses the sort
+ * order and reverses uppercase/lowercase behavior.
+ *
+ * The version is always reported as 2.72, so loading it will cause a version
+ * mismatch warning.
+ *
+ * -------------------------------------------------------------------------
+ */
+
+#include "test_collation_lib_hooks.h"
+
+#ifdef USE_ICU
+
+#include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
+
+#define TEST_LOCALE "en_US"
+
+typedef struct TestUCollator {
+	UCollator	*ucol;
+	bool		 reverse;
+} TestUCollator;
+
+static pg_icu_library *test_icu_library = NULL;
+static const UVersionInfo test_icu_version = { 2, 72 };
+
+static bool
+locale_is_reverse(const char *locale)
+{
+	if (strcmp(locale, "DESC") == 0)
+		return true;
+	else
+		return false;
+}
+
+static UCollator *
+test_openCollator(const char *loc, UErrorCode *status)
+{
+	TestUCollator *testcol = MemoryContextAlloc(TopMemoryContext, sizeof(TestUCollator));
+	UCollator *ucol = ucol_open(TEST_LOCALE, status);
+	testcol->ucol = ucol;
+	testcol->reverse = locale_is_reverse(loc);
+	return (UCollator *)testcol;
+}
+
+static void
+test_closeCollator(UCollator *coll)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_close(testcol->ucol);
+	pfree(testcol);
+}
+
+static void
+test_setAttribute(UCollator *coll, UColAttribute attr,
+				  UColAttributeValue value, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	ucol_setAttribute(testcol->ucol, attr, value, status);
+}
+
+static void
+test_getCollatorVersion(const UCollator *coll, UVersionInfo info)
+{
+	memcpy(info, test_icu_version, sizeof(UVersionInfo));
+}
+
+static UCollationResult
+test_strcoll(const UCollator *coll, const UChar *source, int32_t sourceLength,
+			 const UChar *target, int32_t targetLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcoll(testcol->ucol, source, sourceLength,
+										target, targetLength);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static UCollationResult
+test_strcollUTF8(const UCollator *coll, const char *source,
+				 int32_t sourceLength, const char *target,
+				 int32_t targetLength, UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	UCollationResult ret = ucol_strcollUTF8(testcol->ucol, source,
+											sourceLength, target,
+											targetLength, status);
+	if (testcol->reverse)
+		return -ret;
+	else
+		return ret;
+}
+
+static int32_t
+test_getSortKey(const UCollator *coll, const UChar *source,
+				int32_t sourceLength, uint8_t *result, int32_t resultLength)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_getSortKey(testcol->ucol, source, sourceLength,
+								  result, resultLength);
+	size_t result_size = ret + 1;
+
+	if (resultLength >= result_size)
+	{
+		result[resultLength] = '\0';
+
+		if (testcol->reverse)
+			for (int i = 0; i < result_size; i++)
+				*((unsigned char *) result + i) ^= (unsigned char) 0xff;
+	}
+
+	return result_size;
+}
+
+static int32_t
+test_nextSortKeyPart(const UCollator *coll, UCharIterator *iter,
+					 uint32_t state[2], uint8_t *dest, int32_t count,
+					 UErrorCode *status)
+{
+	TestUCollator *testcol = (TestUCollator *) coll;
+	int32_t ret = ucol_nextSortKeyPart(testcol->ucol, iter, state, dest,
+									   count, status);
+
+	if (testcol->reverse)
+		for (int i = 0; i < ret; i++)
+			*((unsigned char *) dest + i) ^= (unsigned char) 0xff;
+
+	/*
+	 * The following is not correct for cases where we finish precisely on the
+	 * boundary (i.e. count is exactly enough). To fix this we'd need to track
+	 * additional state across calls, which doesn't seem worth it for a test
+	 * case.
+	 */
+	if (count >= ret && ret > 0)
+	{
+		if (testcol->reverse)
+			dest[ret] = 0xff;
+		else
+			dest[ret] = '\0';
+		return ret + 1;
+	}
+
+	return ret;
+}
+
+static int32_t
+test_strToUpper(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+static int32_t
+test_strToLower(UChar *dest, int32_t destCapacity, const UChar *src,
+				int32_t srcLength, const char *locale, UErrorCode *pErrorCode)
+{
+	if (locale_is_reverse(locale))
+		return u_strToUpper(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+	else
+		return u_strToLower(dest, destCapacity, src, srcLength,
+							TEST_LOCALE, pErrorCode);
+}
+
+pg_icu_library *
+test_get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib;
+
+	if (test_icu_library != NULL)
+		return test_icu_library;
+
+	ereport(LOG, (errmsg("loading custom ICU provider for test_collation_lib_hooks")));
+
+	lib = MemoryContextAlloc(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = test_openCollator;
+	lib->closeCollator = test_closeCollator;
+	lib->getCollatorVersion = test_getCollatorVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = test_strcoll;
+	lib->strcollUTF8 = test_strcollUTF8;
+	lib->getSortKey = test_getSortKey;
+	lib->nextSortKeyPart = test_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = test_strToUpper;
+	lib->strToLower = test_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = test_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	test_icu_library = lib;
+	return lib;
+}
+
+#endif				/* USE_ICU */
-- 
2.34.1

v6-0004-Support-multiple-ICU-collation-provider-libraries.patchtext/x-patch; charset=UTF-8; name=v6-0004-Support-multiple-ICU-collation-provider-libraries.patchDownload

From 1fbd0b038c6ded968359fa2fa00019f9a9473011 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 7 Dec 2022 11:07:31 -0800
Subject: [PATCH v6 4/7] Support multiple ICU collation provider libraries.

Introduce structure pg_icu_library, which holds pointers to each
required ICU method, and store this as part of pg_locale_t. Each call
to an ICU function instead goes through this structure, so that it can
more easily be replaced by a non-builtin ICU library.

Offer a hook to allow an extension to control which ICU library is
loaded for a given locale and version.

Author: Thomas Munro, Jeff Davis
---
 src/backend/commands/collationcmds.c   |  17 +-
 src/backend/utils/adt/formatting.c     |  67 ++++--
 src/backend/utils/adt/pg_locale.c      | 299 ++++++++++++++++++-------
 src/include/utils/pg_locale_internal.h | 121 +++++++++-
 4 files changed, 398 insertions(+), 106 deletions(-)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index d878be2fb8..217c13f066 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -566,13 +566,14 @@ get_icu_language_tag(const char *localename)
 {
 	char		buf[ULOC_FULLNAME_CAPACITY];
 	UErrorCode	status;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	uloc_toLanguageTag(localename, buf, sizeof(buf), true, &status);
+	iculib->toLanguageTag(localename, buf, sizeof(buf), true, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not convert locale name \"%s\" to language tag: %s",
-						localename, u_errorName(status))));
+						localename, iculib->errorName(status))));
 
 	return pstrdup(buf);
 }
@@ -591,11 +592,12 @@ get_icu_locale_comment(const char *localename)
 	int32		len_uchar;
 	int32		i;
 	char	   *result;
+	pg_icu_library *iculib = get_default_icu_library();
 
 	status = U_ZERO_ERROR;
-	len_uchar = uloc_getDisplayName(localename, "en",
-									displayname, lengthof(displayname),
-									&status);
+	len_uchar = iculib->getDisplayName(localename, "en",
+									   displayname, lengthof(displayname),
+									   &status);
 	if (U_FAILURE(status))
 		return NULL;			/* no good reason to raise an error */
 
@@ -921,12 +923,13 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 	{
 		int			i;
+		pg_icu_library *iculib = get_default_icu_library();
 
 		/*
 		 * Start the loop at -1 to sneak in the root locale without too much
 		 * code duplication.
 		 */
-		for (i = -1; i < uloc_countAvailable(); i++)
+		for (i = -1; i < iculib->countAvailable(); i++)
 		{
 			const char *name;
 			char	   *langtag;
@@ -937,7 +940,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
 			if (i == -1)
 				name = "";		/* ICU root locale */
 			else
-				name = uloc_getAvailable(i);
+				name = iculib->getAvailable(i);
 
 			langtag = get_icu_language_tag(name);
 			iculocstr = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f2fed8a2d8..40bfad075b 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1566,6 +1566,11 @@ typedef int32_t (*ICU_Convert_Func) (UChar *dest, int32_t destCapacity,
 									 const UChar *src, int32_t srcLength,
 									 const char *locale,
 									 UErrorCode *pErrorCode);
+typedef int32_t (*ICU_Convert_BI_Func) (UChar *dest, int32_t destCapacity,
+										const UChar *src, int32_t srcLength,
+										UBreakIterator *bi,
+										const char *locale,
+										UErrorCode *pErrorCode);
 
 static int32_t
 icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
@@ -1573,6 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 {
 	UErrorCode	status;
 	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
 	len_dest = len_source;		/* try first with same length */
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
@@ -1590,18 +1596,42 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("case conversion failed: %s", u_errorName(status))));
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
 	return len_dest;
 }
 
+/*
+ * Like icu_convert_case, but func takes a break iterator (which we don't
+ * make use of).
+ */
 static int32_t
-u_strToTitle_default_BI(UChar *dest, int32_t destCapacity,
-						const UChar *src, int32_t srcLength,
-						const char *locale,
-						UErrorCode *pErrorCode)
+icu_convert_case_bi(ICU_Convert_BI_Func func, pg_locale_t mylocale,
+					UChar **buff_dest, UChar *buff_source, int32_t len_source)
 {
-	return u_strToTitle(dest, destCapacity, src, srcLength,
-						NULL, locale, pErrorCode);
+	UErrorCode	status;
+	int32_t		len_dest;
+	pg_icu_library *iculib = PG_ICU_LIB(mylocale);
+
+	len_dest = len_source;		/* try first with same length */
+	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+	status = U_ZERO_ERROR;
+	len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+					mylocale->ctype, &status);
+	if (status == U_BUFFER_OVERFLOW_ERROR)
+	{
+		/* try again with adjusted length */
+		pfree(*buff_dest);
+		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
+		status = U_ZERO_ERROR;
+		len_dest = func(*buff_dest, len_dest, buff_source, len_source, NULL,
+						mylocale->ctype, &status);
+	}
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("case conversion failed: %s",
+						iculib->errorName(status))));
+	return len_dest;
 }
 
 #endif							/* USE_ICU */
@@ -1667,11 +1697,12 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 			int32_t		len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToLower, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToLower, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1789,11 +1820,12 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToUpper, mylocale,
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case(iculib->strToUpper, mylocale,
 										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
@@ -1912,11 +1944,12 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 						len_conv;
 			UChar	   *buff_uchar;
 			UChar	   *buff_conv;
+			pg_icu_library *iculib = PG_ICU_LIB(mylocale);
 
-			len_uchar = icu_to_uchar(&buff_uchar, buff, nbytes);
-			len_conv = icu_convert_case(u_strToTitle_default_BI, mylocale,
-										&buff_conv, buff_uchar, len_uchar);
-			icu_from_uchar(&result, buff_conv, len_conv);
+			len_uchar = icu_to_uchar(iculib, &buff_uchar, buff, nbytes);
+			len_conv = icu_convert_case_bi(iculib->strToTitle, mylocale,
+										   &buff_conv, buff_uchar, len_uchar);
+			icu_from_uchar(iculib, &result, buff_conv, len_conv);
 			pfree(buff_uchar);
 			pfree(buff_conv);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c9e63f5ed4..2c911fb111 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -70,6 +70,8 @@
 
 #ifdef USE_ICU
 #include <unicode/ucnv.h>
+#include <unicode/ulocdata.h>
+#include <unicode/ustring.h>
 #endif
 
 #ifdef __GLIBC__
@@ -110,6 +112,41 @@ char	   *localized_full_days[7 + 1];
 char	   *localized_abbrev_months[12 + 1];
 char	   *localized_full_months[12 + 1];
 
+/*
+ * The precise version of a collation provider library is important, because
+ * subtle changes in collation between library versions can corrupt
+ * indexes. This hook allows control over how collation provider libraries are
+ * loaded.
+ *
+ * If the hook is not set, or if it returns NULL, Postgres constructs the
+ * pg_icu_library structure from the symbols Postgres is linked with at build
+ * time.
+ *
+ * If the hook is set, it can instead construct the pg_icu_library structure
+ * using custom logic. Ordinarily, this custom logic would involve finding a
+ * specific known version of the collation provider library, and dynamically
+ * loading the necessary symbols. If the collation version obtained from the
+ * library does not match the collation version recorded in the catalog,
+ * Postgres will issue a WARNING.
+ *
+ * The returned structure must be allocated in TopMemoryContext, and the
+ * associated symbols must remain valid permanently. It's expected that the
+ * hook will maintain a small finite list of libraries and return a pointer to
+ * the appropraite one, so that resource use remains bounded.
+ *
+ * XXX: For now, the only information the hook has access to is the ICU locale
+ * name, and the collation version as recorded in the catalog. We should
+ * consider what other information can be provided to allow for greater
+ * control over which library is loaded.
+ */
+#ifdef USE_ICU
+get_icu_library_hook_type get_icu_library_hook = NULL;
+#endif
+
+#ifdef USE_ICU
+static pg_icu_library *builtin_icu_library = NULL;
+#endif
+
 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
 static bool CurrentLCTimeValid = false;
@@ -136,8 +173,12 @@ static char *IsoLocaleName(const char *);
  * Database default locale.
  */
 static pg_locale_t default_locale = NULL;
+#ifdef USE_ICU
+static pg_icu_library *default_icu = NULL;
+#endif
 
 #ifdef USE_ICU
+
 /*
  * Converter object for converting between ICU's UChar strings and C strings
  * in database encoding.  Since the database encoding doesn't change, we only
@@ -145,13 +186,17 @@ static pg_locale_t default_locale = NULL;
  */
 static UConverter *icu_converter = NULL;
 
-static void init_icu_converter(void);
-static size_t uchar_length(UConverter *converter,
+static void init_icu_converter(pg_icu_library *iculib);
+static size_t uchar_length(pg_icu_library *iculib,
+						   UConverter *converter,
 						   const char *str, size_t len);
-static int32_t uchar_convert(UConverter *converter,
+static int32_t uchar_convert(pg_icu_library *iculib,
+							 UConverter *converter,
 							 UChar *dest, int32_t destlen,
 							 const char *str, size_t srclen);
-static void icu_set_collation_attributes(UCollator *collator, const char *loc);
+static void icu_set_collation_attributes(pg_icu_library *iculib,
+										 UCollator *collator,
+										 const char *loc);
 #endif
 
 /*
@@ -1458,6 +1503,78 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+#ifdef USE_ICU
+static pg_icu_library *
+get_builtin_icu_library()
+{
+	pg_icu_library *lib;
+
+	if (builtin_icu_library != NULL)
+		return builtin_icu_library;
+
+	/*
+	 * These assignments will fail to compile if an incompatible API change is
+	 * made to some future version of ICU, at which point we might need to
+	 * consider special treatment for different major version ranges, with
+	 * intermediate trampoline functions.
+	 */
+	lib = MemoryContextAllocZero(TopMemoryContext, sizeof(pg_icu_library));
+	lib->getICUVersion = u_getVersion;
+	lib->getUnicodeVersion = u_getUnicodeVersion;
+	lib->getCLDRVersion = ulocdata_getCLDRVersion;
+	lib->openCollator = ucol_open;
+	lib->closeCollator = ucol_close;
+	lib->getCollatorVersion = ucol_getVersion;
+	lib->getUCAVersion = ucol_getUCAVersion;
+	lib->versionToString = u_versionToString;
+	lib->strcoll = ucol_strcoll;
+	lib->strcollUTF8 = ucol_strcollUTF8;
+	lib->getSortKey = ucol_getSortKey;
+	lib->nextSortKeyPart = ucol_nextSortKeyPart;
+	lib->setUTF8 = uiter_setUTF8;
+	lib->errorName = u_errorName;
+	lib->strToUpper = u_strToUpper;
+	lib->strToLower = u_strToLower;
+	lib->strToTitle = u_strToTitle;
+	lib->setAttribute = ucol_setAttribute;
+	lib->openConverter = ucnv_open;
+	lib->closeConverter = ucnv_close;
+	lib->fromUChars = ucnv_fromUChars;
+	lib->toUChars = ucnv_toUChars;
+	lib->toLanguageTag = uloc_toLanguageTag;
+	lib->getDisplayName = uloc_getDisplayName;
+	lib->countAvailable = uloc_countAvailable;
+	lib->getAvailable = uloc_getAvailable;
+
+	/*
+	 * Also assert the size of a couple of types used as output buffers, as a
+	 * canary to tell us to add extra padding in the (unlikely) event that a
+	 * later release makes these values smaller.
+	 */
+	StaticAssertStmt(U_MAX_VERSION_STRING_LENGTH == 20,
+					 "u_versionToString output buffer size changed incompatibly");
+	StaticAssertStmt(U_MAX_VERSION_LENGTH == 4,
+					 "ucol_getVersion output buffer size changed incompatibly");
+
+	builtin_icu_library = lib;
+	return lib;
+}
+
+static pg_icu_library *
+get_icu_library(const char *locale, const char *version)
+{
+	pg_icu_library *lib = NULL;
+
+	if (get_icu_library_hook != NULL)
+		lib = get_icu_library_hook(locale, version);
+
+	if (!lib)
+		lib = get_builtin_icu_library();
+
+	return lib;
+}
+
+#endif
 
 /*
  * Construct a new pg_locale_t object.
@@ -1554,20 +1671,24 @@ pg_newlocale(char provider, bool deterministic, const char *collate,
 	{
 		UCollator  *collator;
 		UErrorCode	status;
+		pg_icu_library *iculib = get_icu_library(collate, version);
+
+		Assert(strcmp(collate, ctype) == 0);
 
 		/* collator may be leaked if we encounter an error */
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collate, &status);
+		collator = iculib->openCollator(collate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collate, u_errorName(status))));
+							collate, iculib->errorName(status))));
 
 		if (U_ICU_VERSION_MAJOR_NUM < 54)
-			icu_set_collation_attributes(collator, collate);
+			icu_set_collation_attributes(iculib, collator, collate);
 
 		result->info.icu.ucol = collator;
+		result->info.icu.lib = iculib;
 	}
 #endif
 	else
@@ -1589,6 +1710,14 @@ pg_locale_deterministic(pg_locale_t locale)
 		return locale->deterministic;
 }
 
+#ifdef USE_ICU
+pg_icu_library *
+get_default_icu_library()
+{
+	return default_icu;
+}
+#endif
+
 /*
  * Initialize default database locale.
  */
@@ -1596,6 +1725,10 @@ void
 init_default_locale(char provider, const char *collate, const char *ctype,
 					const char *iculocale, const char *version)
 {
+#ifdef USE_ICU
+	default_icu = get_icu_library(iculocale, version);
+#endif
+
 	/*
 	 * For the purposes of pg_locale_t, if the provider is ICU, we use
 	 * iculocale for both collate and ctype.
@@ -1759,17 +1892,18 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		UErrorCode	status;
 		UVersionInfo versioninfo;
 		char		buf[U_MAX_VERSION_STRING_LENGTH];
+		pg_icu_library	*iculib = get_icu_library(collcollate, NULL);
 
 		status = U_ZERO_ERROR;
-		collator = ucol_open(collcollate, &status);
+		collator = iculib->openCollator(collcollate, &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("could not open collator for locale \"%s\": %s",
-							collcollate, u_errorName(status))));
-		ucol_getVersion(collator, versioninfo);
-		ucol_close(collator);
+							collcollate, iculib->errorName(status))));
+		iculib->getCollatorVersion(collator, versioninfo);
+		iculib->closeCollator(collator);
 
-		u_versionToString(versioninfo, buf);
+		iculib->versionToString(versioninfo, buf);
 		collversion = pstrdup(buf);
 	}
 	else
@@ -2021,16 +2155,17 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	UChar	*uchar1,
 			*uchar2;
 	int		 result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 #ifdef HAVE_UCOL_STRCOLLUTF8
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 #endif
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen1 = uchar_length(icu_converter, arg1, len1);
-	ulen2 = uchar_length(icu_converter, arg2, len2);
+	ulen1 = uchar_length(iculib, icu_converter, arg1, len1);
+	ulen2 = uchar_length(iculib, icu_converter, arg2, len2);
 
 	bufsize1 = (ulen1 + 1) * sizeof(UChar);
 	bufsize2 = (ulen2 + 1) * sizeof(UChar);
@@ -2041,12 +2176,12 @@ pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
 	uchar1 = (UChar *) buf;
 	uchar2 = (UChar *) (buf + bufsize1);
 
-	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
-	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+	ulen1 = uchar_convert(iculib, icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(iculib, icu_converter, uchar2, ulen2 + 1, arg2, len2);
 
-	result = ucol_strcoll(locale->info.icu.ucol,
-						  uchar1, ulen1,
-						  uchar2, ulen2);
+	result = iculib->strcoll(locale->info.icu.ucol,
+							 uchar1, ulen1,
+							 uchar2, ulen2);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2067,6 +2202,7 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 				pg_locale_t locale)
 {
 	int result;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
@@ -2076,13 +2212,14 @@ pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
 		UErrorCode	status;
 
 		status = U_ZERO_ERROR;
-		result = ucol_strcollUTF8(locale->info.icu.ucol,
-								  arg1, len1,
-								  arg2, len2,
-								  &status);
+		result = iculib->strcollUTF8(locale->info.icu.ucol,
+									 arg1, len1,
+									 arg2, len2,
+									 &status);
 		if (U_FAILURE(status))
 			ereport(ERROR,
-					(errmsg("collation failed: %s", u_errorName(status))));
+					(errmsg("collation failed: %s",
+							iculib->errorName(status))));
 	}
 	else
 #endif
@@ -2261,12 +2398,13 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 	int32_t	 ulen;
 	size_t   uchar_bsize;
 	Size	 result_bsize;
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2275,11 +2413,11 @@ pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
-	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
-								   uchar, ulen,
-								   (uint8_t *) dest, destsize);
+	result_bsize = iculib->getSortKey(locale->info.icu.ucol,
+									  uchar, ulen,
+									  (uint8_t *) dest, destsize);
 
 	if (buf != sbuf)
 		pfree(buf);
@@ -2308,13 +2446,14 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 	UChar			*uchar = NULL;
 	size_t			 uchar_bsize;
 	Size			 result_bsize;
+	pg_icu_library	*iculib = PG_ICU_LIB(locale);
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
 	Assert(GetDatabaseEncoding() != PG_UTF8);
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	ulen = uchar_length(icu_converter, src, srclen);
+	ulen = uchar_length(iculib, icu_converter, src, srclen);
 
 	uchar_bsize = (ulen + 1) * sizeof(UChar);
 
@@ -2323,21 +2462,19 @@ pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
 
 	uchar = (UChar *) buf;
 
-	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+	ulen = uchar_convert(iculib, icu_converter, uchar, ulen + 1, src, srclen);
 
 	uiter_setString(&iter, uchar, ulen);
 	state[0] = state[1] = 0;	/* won't need that again */
 	status = U_ZERO_ERROR;
-	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
-										&iter,
-										state,
-										(uint8_t *) dest,
-										destsize,
-										&status);
+	result_bsize = iculib->nextSortKeyPart(
+		locale->info.icu.ucol, &iter, state,
+		(uint8_t *) dest, destsize, &status);
+
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("sort key generation failed: %s",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return result_bsize;
 }
@@ -2346,6 +2483,7 @@ static size_t
 pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 					   size_t destsize, pg_locale_t locale)
 {
+	pg_icu_library *iculib = PG_ICU_LIB(locale);
 	size_t result;
 
 	Assert(locale->provider == COLLPROVIDER_ICU);
@@ -2356,19 +2494,17 @@ pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
 		uint32_t	state[2];
 		UErrorCode	status;
 
-		uiter_setUTF8(&iter, src, srclen);
+		iculib->setUTF8(&iter, src, srclen);
 		state[0] = state[1] = 0;	/* won't need that again */
 		status = U_ZERO_ERROR;
-		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
-									  &iter,
-									  state,
-									  (uint8_t *) dest,
-									  destsize,
-									  &status);
+		result = iculib->nextSortKeyPart(
+			locale->info.icu.ucol, &iter, state,
+			(uint8_t *) dest, destsize, &status);
+
 		if (U_FAILURE(status))
 			ereport(ERROR,
 					(errmsg("sort key generation failed: %s",
-							u_errorName(status))));
+							iculib->errorName(status))));
 	}
 	else
 		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
@@ -2564,7 +2700,7 @@ pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 
 #ifdef USE_ICU
 static void
-init_icu_converter(void)
+init_icu_converter(pg_icu_library *iculib)
 {
 	const char *icu_encoding_name;
 	UErrorCode	status;
@@ -2581,11 +2717,11 @@ init_icu_converter(void)
 						pg_encoding_to_char(GetDatabaseEncoding()))));
 
 	status = U_ZERO_ERROR;
-	conv = ucnv_open(icu_encoding_name, &status);
+	conv = iculib->openConverter(icu_encoding_name, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open ICU converter for encoding \"%s\": %s",
-						icu_encoding_name, u_errorName(status))));
+						icu_encoding_name, iculib->errorName(status))));
 
 	icu_converter = conv;
 }
@@ -2594,14 +2730,15 @@ init_icu_converter(void)
  * Find length, in UChars, of given string if converted to UChar string.
  */
 static size_t
-uchar_length(UConverter *converter, const char *str, size_t len)
+uchar_length(pg_icu_library *iculib, UConverter *converter, const char *str, size_t len)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
-	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	ulen = iculib->toUChars(converter, NULL, 0, str, len, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2610,16 +2747,17 @@ uchar_length(UConverter *converter, const char *str, size_t len)
  * return the length (in UChars).
  */
 static int32_t
-uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
-			  const char *src, size_t srclen)
+uchar_convert(pg_icu_library *iculib, UConverter *converter, UChar *dest,
+			  int32_t destlen, const char *src, size_t srclen)
 {
 	UErrorCode	status = U_ZERO_ERROR;
 	int32_t		ulen;
 	status = U_ZERO_ERROR;
-	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	ulen = iculib->toUChars(converter, dest, destlen, src, srclen, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+				(errmsg("%s failed: %s", "ucnv_toUChars",
+						iculib->errorName(status))));
 	return ulen;
 }
 
@@ -2636,16 +2774,17 @@ uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
  * result length instead.
  */
 int32_t
-icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
+icu_to_uchar(pg_icu_library *iculib, UChar **buff_uchar, const char *buff,
+			 size_t nbytes)
 {
 	int32_t len_uchar;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
-	len_uchar = uchar_length(icu_converter, buff, nbytes);
+	len_uchar = uchar_length(iculib, icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-	len_uchar = uchar_convert(icu_converter,
+	len_uchar = uchar_convert(iculib, icu_converter,
 							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
@@ -2663,30 +2802,32 @@ icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
  * The result string is nul-terminated.
  */
 int32_t
-icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
+icu_from_uchar(pg_icu_library *iculib, char **result, const UChar *buff_uchar,
+			   int32_t len_uchar)
 {
 	UErrorCode	status;
 	int32_t		len_result;
 
-	init_icu_converter();
+	init_icu_converter(iculib);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, NULL, 0,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, NULL, 0,
+									buff_uchar, len_uchar, &status);
 	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	*result = palloc(len_result + 1);
 
 	status = U_ZERO_ERROR;
-	len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
-								 buff_uchar, len_uchar, &status);
+	len_result = iculib->fromUChars(icu_converter, *result,
+									len_result + 1, buff_uchar,
+									len_uchar, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("%s failed: %s", "ucnv_fromUChars",
-						u_errorName(status))));
+						iculib->errorName(status))));
 
 	return len_result;
 }
@@ -2702,7 +2843,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
  */
 pg_attribute_unused()
 static void
-icu_set_collation_attributes(UCollator *collator, const char *loc)
+icu_set_collation_attributes(pg_icu_library *iculib, UCollator *collator,
+							 const char *loc)
 {
 	char	   *str = asc_tolower(loc, strlen(loc));
 
@@ -2776,7 +2918,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 				status = U_ILLEGAL_ARGUMENT_ERROR;
 
 			if (status == U_ZERO_ERROR)
-				ucol_setAttribute(collator, uattr, uvalue, &status);
+				iculib->setAttribute(collator, uattr, uvalue, &status);
 
 			/*
 			 * Pretend the error came from ucol_open(), for consistent error
@@ -2785,7 +2927,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc)
 			if (U_FAILURE(status))
 				ereport(ERROR,
 						(errmsg("could not open collator for locale \"%s\": %s",
-								loc, u_errorName(status))));
+								loc, iculib->errorName(status))));
 		}
 	}
 }
@@ -2801,17 +2943,18 @@ check_icu_locale(const char *icu_locale)
 #ifdef USE_ICU
 	UCollator	*collator;
 	UErrorCode   status;
+	pg_icu_library *iculib = get_icu_library(icu_locale, NULL);
 
 	status = U_ZERO_ERROR;
-	collator = ucol_open(icu_locale, &status);
+	collator = iculib->openCollator(icu_locale, &status);
 	if (U_FAILURE(status))
 		ereport(ERROR,
 				(errmsg("could not open collator for locale \"%s\": %s",
-						icu_locale, u_errorName(status))));
+						icu_locale, iculib->errorName(status))));
 
 	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, icu_locale);
-	ucol_close(collator);
+		icu_set_collation_attributes(iculib, collator, icu_locale);
+	iculib->closeCollator(collator);
 #else
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
index 33465ad92d..afb6c24d64 100644
--- a/src/include/utils/pg_locale_internal.h
+++ b/src/include/utils/pg_locale_internal.h
@@ -14,6 +14,8 @@
 #define _PG_LOCALE_INTERNAL_
 
 #ifdef USE_ICU
+#include <unicode/ubrk.h>
+#include <unicode/ucnv.h>
 #include <unicode/ucol.h>
 #endif
 
@@ -30,6 +32,106 @@
 #endif
 #endif
 
+#ifdef USE_ICU
+/*
+ * An ICU library version that we're either linked against or have loaded at
+ * runtime.
+ */
+typedef struct pg_icu_library
+{
+	int			major_version;
+	int			minor_version;
+	void		(*getICUVersion) (UVersionInfo info);
+	void		(*getUnicodeVersion) (UVersionInfo into);
+	void		(*getCLDRVersion) (UVersionInfo info, UErrorCode *status);
+	UCollator  *(*openCollator) (const char *loc, UErrorCode *status);
+	void		(*closeCollator) (UCollator *coll);
+	void		(*getCollatorVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*getUCAVersion) (const UCollator *coll, UVersionInfo info);
+	void		(*versionToString) (const UVersionInfo versionArray,
+									char *versionString);
+	UCollationResult (*strcoll) (const UCollator *coll,
+								 const UChar *source,
+								 int32_t sourceLength,
+								 const UChar *target,
+								 int32_t targetLength);
+	UCollationResult (*strcollUTF8) (const UCollator *coll,
+									 const char *source,
+									 int32_t sourceLength,
+									 const char *target,
+									 int32_t targetLength,
+									 UErrorCode *status);
+	int32_t		(*getSortKey) (const UCollator *coll,
+							   const UChar *source,
+							   int32_t sourceLength,
+							   uint8_t *result,
+							   int32_t resultLength);
+	int32_t		(*nextSortKeyPart) (const UCollator *coll,
+									UCharIterator *iter,
+									uint32_t state[2],
+									uint8_t *dest,
+									int32_t count,
+									UErrorCode *status);
+	void		(*setUTF8) (UCharIterator *iter,
+							const char *s,
+							int32_t length);
+	const char *(*errorName) (UErrorCode code);
+	int32_t		(*strToUpper) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToLower) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	int32_t		(*strToTitle) (UChar *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UBreakIterator *titleIter,
+							   const char *locale,
+							   UErrorCode *pErrorCode);
+	void		(*setAttribute) (UCollator *coll,
+								 UColAttribute attr,
+								 UColAttributeValue value,
+								 UErrorCode *status);
+	UConverter *(*openConverter) (const char *converterName,
+								  UErrorCode *  	err);
+	void		(*closeConverter) (UConverter *converter);
+	int32_t		(*fromUChars) (UConverter *cnv,
+							   char *dest,
+							   int32_t destCapacity,
+							   const UChar *src,
+							   int32_t srcLength,
+							   UErrorCode *pErrorCode);
+	int32_t		(*toUChars) (UConverter *cnv,
+							 UChar *dest,
+							 int32_t destCapacity,
+							 const char *src,
+							 int32_t srcLength,
+							 UErrorCode *pErrorCode);
+	int32_t		(*toLanguageTag) (const char *localeID,
+								  char *langtag,
+								  int32_t langtagCapacity,
+								  UBool strict,
+								  UErrorCode *err);
+	int32_t		(*getDisplayName) (const char *localeID,
+								   const char *inLocaleID,
+								   UChar *result,
+								   int32_t maxResultSize,
+								   UErrorCode *err);
+	int32_t		(*countAvailable) (void);
+	const char *(*getAvailable) (int32_t n);
+} pg_icu_library;
+
+#define PG_ICU_LIB(x) ((x)->info.icu.lib)
+
+#endif
+
 /*
  * We define our own wrapper around locale_t so we can keep the same
  * function signatures for all builds, while not having to create a
@@ -53,7 +155,8 @@ struct pg_locale_struct
 #ifdef USE_ICU
 		struct
 		{
-			UCollator	*ucol;
+			UCollator		*ucol;
+			pg_icu_library	*lib;
 		}			icu;
 #endif
 		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
@@ -61,8 +164,18 @@ struct pg_locale_struct
 };
 
 #ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
+
+typedef pg_icu_library *(*get_icu_library_hook_type)(
+	const char *locale, const char *version);
+
+extern PGDLLIMPORT get_icu_library_hook_type get_icu_library_hook;
+
+extern pg_icu_library *get_default_icu_library(void);
+extern int32_t icu_to_uchar(pg_icu_library *lib, UChar **buff_uchar,
+							const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(pg_icu_library *lib, char **result,
+							  const UChar *buff_uchar, int32_t len_uchar);
+
+#endif							/* USE_ICU */
 
 #endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v6-0003-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v6-0003-Refactor-pg_locale_t-routines.patchDownload

From da30643a89f0c1e45a238ba2682d5c44bc9832d4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v6 3/7] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |   8 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 400 insertions(+), 323 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,36 +282,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -337,7 +329,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 6a4311cc63..d878be2fb8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 323f00516d..3dc89b0336 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index a4b524ea3a..f2fed8a2d8 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -1577,7 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1585,7 +1586,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1698,7 +1699,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1731,7 +1732,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1820,7 +1821,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1853,7 +1854,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1945,10 +1946,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1990,10 +1991,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index fc6cb7f5b7..6fdfa2cebd 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -97,7 +98,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 9b603d42f3..43150741c8 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f40128c597..c9e63f5ed4 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2111,7 +2193,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2717,8 +2799,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2792,10 +2874,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2869,10 +2951,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 5bbba94687..8dc1faaaf2 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 26333c1659..59531cab06 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1205,7 +1205,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1603,7 +1603,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1662,7 +1662,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1776,7 +1776,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..c26dfb6104 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index c08e6ce07f..7be42fc21e 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -65,39 +49,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -115,10 +72,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v6-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchtext/x-patch; charset=UTF-8; name=v6-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patchDownload

From 051afe2815e3b0f021540c64aa544ae4b6e3845c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:41:38 -0800
Subject: [PATCH v6 2/7] Add pg_strxfrm() and pg_strxfrm_prefix().

Callers with a NUL-terminated string should call the former; callers
with a string and length should call the latter.

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 ++--
 src/backend/utils/adt/pg_locale.c   | 381 ++++++++++++++++++++++++++++
 src/backend/utils/adt/varchar.c     |  41 ++-
 src/backend/utils/adt/varlena.c     | 142 +++--------
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h       |  11 +
 6 files changed, 481 insertions(+), 150 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 5ff1dcb1ba..f40128c597 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -94,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -2099,6 +2102,384 @@ pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
 }
 
 
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_libc(char *dest, const char *src, size_t destsize,
+					   pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+		 locale->provider);
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_prefix_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
 #ifdef USE_ICU
 static void
 init_icu_converter(void)
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 8ddbae8f51..5bbba94687 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 8f64a3c905..26333c1659 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1889,20 +1889,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -1918,13 +1904,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -1937,10 +1917,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2229,6 +2207,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2241,7 +2220,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2276,14 +2255,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2301,7 +2276,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2309,81 +2284,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2395,12 +2338,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..19dc95ea96 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1954,6 +1954,17 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"trust_strxfrm", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
 			gettext_noop("Whether to continue running after a failure to sync data files."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 40c0d1e3a0..c08e6ce07f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool trust_strxfrm;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -103,6 +104,16 @@ extern char *get_collation_actual_version(char collprovider, const char *collcol
 extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
 extern int pg_strncoll(const char *arg1, size_t len1,
 					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

#11

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Eisentraut (#8)

Re: Rework of collation code, extensibility

On Wed, 2023-01-11 at 15:08 +0100, Peter Eisentraut wrote:

I think the refactoring that you proposed in the thread "Refactor to
introduce pg_strcoll()." was on a sensible track. Maybe we should
try
to get that done.

Those should be patches 0001-0003 in this thread (now at v6), which are
all pure refactoring.

Let's consider those patches the topic of this thread and I'll move
0004-0007 back to the multi-lib ICU thread on the next revision.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#12

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Jeff Davis (#10)

Re: Rework of collation code, extensibility

On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote:

Attached trivial rebase as v6.

Some review comments for this v6.

Comments on 0001-*:

* I think that 0002-* can be squashed with 0001-*, since there isn't
any functional reason why you'd want to commit the strcoll() and
strxfrm() changes separately.

Sometimes it can be useful to break things up, despite the fact that
it couldn't possibly make sense to commit just one of the resulting
patches on its own. However, I don't think that that's appropriate
here. There is no apparent conceptual boundary that you're
highlighting by splitting things up like this. strxfrm() and strcoll()
are defined in terms of each other -- they're siblings, joined at the
hip -- so this seems kinda jarring.

* Your commit message for 0001 (and other patches) don't set things up
by describing what the point is, and what the work anticipates. I
think that they should do that.

You're adding a layer of indirection that's going to set things up for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

* I'm not sure that pg_strcoll() should be breaking ties itself. We
break ties using strcmp() for historical reasons, but must not do that
for deterministic ICU collations, which may be obscured.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship. That kind
of makes pg_strcoll() somewhat more than a strcoll() shim, which is
inconsistent. Another concern is that the deterministic collation
handling isn't handled in any one layer, which would have been nice.

Do we need to do things this way? What's it adding?

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle it
could just use the known length of the string.

Presumably this might have performance implications. Have you thought
about that?

Some comments on 0002-*:

* I don't see much point in this new varstr_abbrev_convert() variable:

+ const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

* Having a separate pg_strxfrm_prefix_libc() function just to throw an
error doesn't really add much IMV.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

Comments on 0006-*:

* get_builtin_libc_library() could be indented in a way that would
make it easier to understand.

--
Peter Geoghegan

#13

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Geoghegan (#12)

2 attachment(s)

Re: Rework of collation code, extensibility

On Fri, 2023-01-13 at 11:57 -0800, Peter Geoghegan wrote:

You're adding a layer of indirection that's going to set things up
for
later patches that add a layer of indirection for version ICU
libraries (and even libc itself), and some of the details only make
sense in that context. This isn't just refactoring work that could
just as easily have happened in some quite different context.

Right, well put. I have two goals and felt that they merged into one
patchset, but I think that caused more confusion.

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I attached new patches:

v7-0001: pg_strcoll and pg_strxfrm patches combined, your comments
addressed
v7-0002: add pg_locale_internal.h (and other refactoring)

I will post the other patches in the other thread.

That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
the same as the well known strcoll()/strxfrm() relationship.

That's a really good point. I changed tiebreaking to be the caller's
responsibility.

* varstrfastcmp_locale() is no longer capable of calling
ucol_strcollUTF8() through the shim interface, meaning that it has to
determine string length based on NUL-termination, when in principle
it
could just use the known length of the string.

I think you misread, it still calls ucol_strcollUTF8() when applicable,
which is impoartant because otherwise it would require a conversion to
a UChar string.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

* I don't see much point in this new varstr_abbrev_convert()
variable:

+ const size_t max_prefix_bytes = sizeof(Datum);

varstr_abbrev_convert() is concerned with packing abbreviated key
bytes into Datums, so it's perfectly reasonable to deal with
Datums/sizeof(Datum) directly.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

* Having a separate pg_strxfrm_prefix_libc() function just to throw
an
error doesn't really add much IMV.

Removed.

Comments on 0003-*:

I suggest that some of the very trivial functions you have here (such
as pg_locale_deterministic()) be made inline functions.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v7-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchtext/x-patch; charset=UTF-8; name=v7-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchDownload

From 965212ed108d0bc8884b38a277b23713788876ea Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v7 1/2] Add pg_strcoll(), pg_strxfrm(), and variants.

In preparation for multi-lib ICU, which should be based on a clean
separation of the routines required for collation providers. Also
offers a generally better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 +-
 src/backend/utils/adt/pg_locale.c   | 783 +++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c     |  41 +-
 src/backend/utils/adt/varlena.c     | 368 ++-----------
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h       |  14 +
 6 files changed, 874 insertions(+), 388 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..509ef0784e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -88,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -123,6 +132,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1753,716 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
+
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strnxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
 
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+			 locale->provider);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2490,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2538,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 8ddbae8f51..5bbba94687 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 33ffdb013a..5030328f31 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1537,10 +1537,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1557,171 +1553,16 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
 		if (result == 0 &&
 			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		{
+			result = memcmp(arg1, arg2, Min(len1, len2));
+			if ((result == 0) && (len1 != len2))
+				result = (len1 < len2) ? -1 : 1;
+		}
 	}
 
 	return result;
@@ -2057,20 +1898,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -2086,13 +1913,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2105,10 +1926,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2379,60 +2198,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
 	if (result == 0 &&
@@ -2455,6 +2221,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2467,7 +2234,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2502,14 +2269,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2527,7 +2290,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2535,81 +2298,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2621,12 +2352,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..19dc95ea96 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1954,6 +1954,17 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"trust_strxfrm", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
 			gettext_noop("Whether to continue running after a failure to sync data files."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..c08e6ce07f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool trust_strxfrm;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -100,6 +101,19 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v7-0002-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v7-0002-Refactor-pg_locale_t-routines.patchDownload

From a3060fd4d181162eab4f99dc956fbe1a0fc6a268 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v7 2/2] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |  14 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 402 insertions(+), 327 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,36 +282,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -337,7 +329,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 6a4311cc63..d878be2fb8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 323f00516d..3dc89b0336 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index a4b524ea3a..f2fed8a2d8 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -1577,7 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1585,7 +1586,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1698,7 +1699,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1731,7 +1732,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1820,7 +1821,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1853,7 +1854,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1945,10 +1946,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1990,10 +1991,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index fc6cb7f5b7..6fdfa2cebd 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -97,7 +98,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 9b603d42f3..43150741c8 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 509ef0784e..d3ca3c5011 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2102,7 +2184,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2699,8 +2781,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2774,10 +2856,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2851,10 +2933,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 5bbba94687..8dc1faaaf2 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 5030328f31..3c4e7be687 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1205,7 +1205,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1556,8 +1556,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
+		if (result == 0 && pg_locale_deterministic(mylocale))
 		{
 			result = memcmp(arg1, arg2, Min(len1, len2));
 			if ((result == 0) && (len1 != len2))
@@ -1612,7 +1611,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1671,7 +1670,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1785,7 +1784,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -2201,8 +2200,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
+	if (result == 0 && pg_locale_deterministic(sss->locale))
 		result = strcmp(sss->buf1, sss->buf2);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..c26dfb6104 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index c08e6ce07f..7be42fc21e 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -65,39 +49,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -115,10 +72,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

#14

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Jeff Davis (#13)

Re: Rework of collation code, extensibility

On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis <pgsql@j-davis.com> wrote:

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I don't disagree that it was somewhat independent of the first goal. I
just think that it makes sense to "round up to fully dependent".
Basically it's not independent enough to be worth talking about as an
independent thing, just as a practical matter - it's confusing at the
level of things like the commit message. There is a clear direction
that you're going in here from the start, and your intentions in 0001
do matter to somebody that's just looking at 0001 in isolation. That
is my opinion, at least.

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

That's good. I'm happy to leave it at that. I was only enquiring.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

I certainly won't insist.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

--
Peter Geoghegan

#15

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Geoghegan (#14)

2 attachment(s)

Re: Rework of collation code, extensibility

On Tue, 2023-01-17 at 14:18 -0800, Peter Geoghegan wrote:

The second goal is a perfectly good enough goal on its own, and one
that I am totally supportive of. Making the code clearer is icing on
the cake.

Attached v8, which is just a rebase.

To reiterate: commitfest entry
https://commitfest.postgresql.org/41/3956/ is dependent on these
patches and is a big part of the motivation for refactoring.

I don't know. Quite possibly not. It would be nice to have some data
on that, though.

I tested with hash aggregation, which might be more dependent on
pg_locale_deterministic() than sorting. I didn't see any significant
difference between master and the refactoring branch, so I don't see a
need to make that function "inline".

I also re-tested sorting and found some interesting results for en-US-
x-icu on a UTF-8 database (which is I suspect one of the most common
configurations for ICU):

* the refactoring branch is now more than 5% faster, whether using
abbreviated keys or not
* disabling abbreviated keys makes sorting 8-10% faster on both
master and the refactoring branch

Both of these are surprising, and I haven't investigated deeply yet.
Maybe something about LTO, some intervening patch, or I just made some
mistakes somewhere (I did this fairly quickly). But as of now, it
doesn't look like the refactoring patch hurts anything.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v8-0002-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v8-0002-Refactor-pg_locale_t-routines.patchDownload

From d1e2e1757b043c876695b8fa8c304b5126efb3aa Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v8 2/2] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |  14 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 402 insertions(+), 327 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,36 +282,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -337,7 +329,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 6a4311cc63..d878be2fb8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 323f00516d..3dc89b0336 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f3f4db5ef6..f45e2043a1 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -1577,7 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1585,7 +1586,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1698,7 +1699,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1731,7 +1732,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1820,7 +1821,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1853,7 +1854,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1945,10 +1946,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1990,10 +1991,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index fc6cb7f5b7..6fdfa2cebd 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -97,7 +98,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 9b603d42f3..43150741c8 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 509ef0784e..d3ca3c5011 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2102,7 +2184,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2699,8 +2781,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2774,10 +2856,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2851,10 +2933,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 5bbba94687..8dc1faaaf2 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 5030328f31..3c4e7be687 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1205,7 +1205,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1556,8 +1556,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
+		if (result == 0 && pg_locale_deterministic(mylocale))
 		{
 			result = memcmp(arg1, arg2, Min(len1, len2));
 			if ((result == 0) && (len1 != len2))
@@ -1612,7 +1611,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1671,7 +1670,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1785,7 +1784,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -2201,8 +2200,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
+	if (result == 0 && pg_locale_deterministic(sss->locale))
 		result = strcmp(sss->buf1, sss->buf2);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..c26dfb6104 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index c08e6ce07f..7be42fc21e 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -65,39 +49,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -115,10 +72,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

v8-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchtext/x-patch; charset=UTF-8; name=v8-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchDownload

From 954ddd0f8b3adb3d66e9b12adc6fc021393e2039 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v8 1/2] Add pg_strcoll(), pg_strxfrm(), and variants.

In preparation for multi-lib ICU, which should be based on a clean
separation of the routines required for collation providers. Also
offers a generally better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c  |  45 +-
 src/backend/utils/adt/pg_locale.c   | 783 +++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c     |  41 +-
 src/backend/utils/adt/varlena.c     | 368 ++-----------
 src/backend/utils/misc/guc_tables.c |  11 +
 src/include/utils/pg_locale.h       |  14 +
 6 files changed, 874 insertions(+), 388 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..509ef0784e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -88,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -123,6 +132,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1753,716 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
+
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strnxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
 
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+			 locale->provider);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2490,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2538,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 8ddbae8f51..5bbba94687 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 33ffdb013a..5030328f31 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1537,10 +1537,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1557,171 +1553,16 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
 		if (result == 0 &&
 			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		{
+			result = memcmp(arg1, arg2, Min(len1, len2));
+			if ((result == 0) && (len1 != len2))
+				result = (len1 < len2) ? -1 : 1;
+		}
 	}
 
 	return result;
@@ -2057,20 +1898,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -2086,13 +1913,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2105,10 +1926,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2379,60 +2198,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
 	if (result == 0 &&
@@ -2455,6 +2221,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2467,7 +2234,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2502,14 +2269,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2527,7 +2290,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2535,81 +2298,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2621,12 +2352,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cd0fc2cb8f..4562b028c7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -1954,6 +1954,17 @@ struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"trust_strxfrm", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
 			gettext_noop("Whether to continue running after a failure to sync data files."),
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..c08e6ce07f 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -40,6 +40,7 @@ extern PGDLLIMPORT char *locale_messages;
 extern PGDLLIMPORT char *locale_monetary;
 extern PGDLLIMPORT char *locale_numeric;
 extern PGDLLIMPORT char *locale_time;
+extern PGDLLIMPORT bool trust_strxfrm;
 
 /* lc_time localization cache */
 extern PGDLLIMPORT char *localized_abbrev_days[];
@@ -100,6 +101,19 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

#16

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#15)

Re: Rework of collation code, extensibility

On Fri, 2023-01-20 at 12:54 -0800, Jeff Davis wrote:

Both of these are surprising, and I haven't investigated deeply yet.

It's just because autoconf defaults to -O2 and meson to -O3, at least
on my machine. It turns out that, at -O2, master and the refactoring
branch are even; but at -O3, both get faster, and the refactoring pulls
ahead by a few percentage points.

At least that's what's happening for en-US-x-icu on UTF-8 with my test
data set. I didn't see much of a difference in other situations, but I
didn't retest those other situations this time around.

We should still look into why disabling abbreviated keys improves
performance in some cases. Maybe we need a GUC for that?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#17

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#15)

4 attachment(s)

Re: Rework of collation code, extensibility

Attached v9 and added perf numbers below.

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

My primary motivation is still related to
https://commitfest.postgresql.org/41/3956/ but the combination of
cleaner code and a performance boost seems like reasonable
justification for this patch set independently.

There aren't any clear open items on this patch. Peter Eisentraut asked
me to focus this thread on the refactoring, which I've done by reducing
it to 2 patches, and I left multilib ICU up to the other thread. He
also questioned the increased line count, but I think the currently-low
line count is due to bad style. PeterG provided some review comments,
in particular when to do the tiebreaking, which I addressed.

This patch has been around for a while, so I'll take a fresh look and
see if I see risk areas, and re-run a few sanity checks. Of course more
feedback would also be welcome.

PERFORMANCE:

======
Setup:
======

base: master with v9-0001 applied (GUCs only)
refactor: master with v9-0001, v9-0002, v9-0003 applied

Note that I wasn't able to see any performance difference between the
base and master, v9-0001 just adds some GUCs to make testing easier.

glibc 2.35 ICU 70.1
gcc 11.3.0 LLVM 14.0.0

built with meson (uses -O3)

$ perl text_generator.pl 10000000 10 > /tmp/strings.utf8.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.utf8.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

=============
Test queries:
=============

EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US";
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";

Timings are measured as the milliseconds to return the first tuple from
the Sort operator (as reported in EXPLAIN ANALYZE). Median of three
runs.

========
Results:
========

base refactor speedup

sort_abbreviated_keys=false:
C 7377 7273 1.4%
en_US 35081 35090 0.0%
en-US-x-ixu 20520 19465 5.4%

sort_abbreviated_keys=true:
C 8105 8008 1.2%
en_US 35067 34850 0.6%
en-US-x-icu 22626 21507 5.2%

===========
Conclusion:
===========

These numbers can move +/-1 percentage point, so I'd interpret anything
less than that as noise. This happens to be the first run where all the
numbers favored the refactoring patch, but it is generally consistent
with what I had seen before.

The important part is that, for ICU, it appears to be a substantial
speedup when using meson (-O3).

Also, when/if the multilib ICU support goes in, that may lose some of
these gains due to an extra indirect function call.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

text_generator.plapplication/x-perl; name=text_generator.plDownload

v9-0001-Introduce-GUCs-to-control-abbreviated-keys-sort-o.patchtext/x-patch; charset=UTF-8; name=v9-0001-Introduce-GUCs-to-control-abbreviated-keys-sort-o.patchDownload

From 39ed011cc51ba3a4af5e3b559a7b8de25fb895a5 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v9 1/3] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml                   | 40 ++++++++++++++++++++++
 src/backend/utils/adt/varlena.c            | 10 +++---
 src/backend/utils/misc/guc_tables.c        | 24 +++++++++++++
 src/backend/utils/sort/tuplesortvariants.c | 17 ++++++---
 4 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc009..8f55b89f35 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-sort-abbreviated-keys" xreflabel="sort_abbreviated_keys">
+      <term><varname>sort_abbreviated_keys</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>sort_abbreviated_keys</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the use of abbreviated sort keys, a sort optimization,
+        if applicable. The default is <literal>true</literal>. Disabling may
+        be useful to diagnose problems or measure performance.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-trust-strxfrm" xreflabel="trust_strxfrm">
+      <term><varname>trust_strxfrm</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>trust_strxfrm</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Abbreviated keys, a sort optimization, depends on correct behavior of
+        the operating system function <function>strxfrm()</function> when
+        using a collation with the <literal>libc</literal> provider. On some
+        platforms <function>strxfrm()</function> does not return results
+        consistent with <function>strcoll()</function>, which means the
+        optimization could return wrong results. Set to
+        <literal>true</literal> if it is certain that
+        <function>strxfrm()</function> can be trusted.
+       </para>
+       <para>
+        The default value is <literal>false</literal>. This setting has no
+        effect if <xref linkend="guc-sort-abbreviated-keys"/> is set to
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-trace-locks" xreflabel="trace_locks">
       <term><varname>trace_locks</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index fd81c47474..c270022483 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -45,6 +45,9 @@
 /* GUC variable */
 int			bytea_output = BYTEA_OUTPUT_HEX;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool		trust_strxfrm = false;
+
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
@@ -2115,7 +2118,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * other libc other than Cygwin has so far been shown to have a problem,
 	 * we take the conservative course of action for right now and disable
 	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * their system can set the trust_strxfrm GUC to true.)
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2128,10 +2131,9 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!trust_strxfrm && !collate_c &&
+		!(locale && locale->provider == COLLPROVIDER_ICU))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..fd4a02fbf5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -102,6 +102,8 @@ extern bool trace_syncscan;
 #ifdef DEBUG_BOUNDED_SORT
 extern bool optimize_bounded_sort;
 #endif
+extern bool sort_abbreviated_keys;
+extern bool trust_strxfrm;
 
 /*
  * Options for enum values defined in this module.
@@ -1673,6 +1675,28 @@ struct config_bool ConfigureNamesBool[] =
 	},
 #endif
 
+	{
+		{"sort_abbreviated_keys", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Enables the use of abbreviated sort keys."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_EXPLAIN
+		},
+		&sort_abbreviated_keys,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"trust_strxfrm", PGC_SUSET, DEVELOPER_OPTIONS,
+		 gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+		 NULL,
+		 GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 #ifdef WAL_DEBUG
 	{
 		{"wal_debug", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..ba16779f97 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -37,6 +37,8 @@
 #define DATUM_SORT		2
 #define CLUSTER_SORT	3
 
+bool sort_abbreviated_keys = true;
+
 static void removeabbrev_heap(Tuplesortstate *state, SortTuple *stups,
 							  int count);
 static void removeabbrev_cluster(Tuplesortstate *state, SortTuple *stups,
@@ -185,7 +187,8 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -295,7 +298,8 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
 		sortKey->ssup_attno = scanKey->sk_attno;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -379,7 +383,8 @@ tuplesort_begin_index_btree(Relation heapRel,
 			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
 		sortKey->ssup_attno = scanKey->sk_attno;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -499,7 +504,8 @@ tuplesort_begin_index_gist(Relation heapRel,
 		sortKey->ssup_nulls_first = false;
 		sortKey->ssup_attno = i + 1;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -573,7 +579,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	 * can't, because a datum sort only stores a single copy of the datum; the
 	 * "tuple" field of each SortTuple is NULL.
 	 */
-	base->sortKeys->abbreviate = !typbyval;
+	if (sort_abbreviated_keys)
+		base->sortKeys->abbreviate = !typbyval;
 
 	PrepareSortSupportFromOrderingOp(sortOperator, base->sortKeys);
 
-- 
2.34.1

v9-0002-Add-pg_strcoll-pg_strxfrm-and-variants.patchtext/x-patch; charset=UTF-8; name=v9-0002-Add-pg_strcoll-pg_strxfrm-and-variants.patchDownload

From ec820c40e7a10df7a261f0056eb62e4429766b70 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v9 2/3] Add pg_strcoll(), pg_strxfrm(), and variants.

In preparation for multi-lib ICU, which should be based on a clean
separation of the routines required for collation providers. Also
offers a generally better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().

Also remove the TRUST_STRXFRM define, and replace with a developer GUC
for easier testing.
---
 src/backend/access/hash/hashfunc.c |  45 +-
 src/backend/utils/adt/pg_locale.c  | 783 ++++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c    |  41 +-
 src/backend/utils/adt/varlena.c    | 370 ++------------
 src/include/utils/pg_locale.h      |  13 +
 5 files changed, 862 insertions(+), 390 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..509ef0784e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -88,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -123,6 +132,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1753,716 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
+
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strnxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
 
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
+
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can set the developer GUC "trust_strxfrm".)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return trust_strxfrm;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+			 locale->provider);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2490,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2538,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 8ddbae8f51..5bbba94687 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index c270022483..5d1bf45009 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -45,9 +45,6 @@
 /* GUC variable */
 int			bytea_output = BYTEA_OUTPUT_HEX;
 
-/* GUC to enable use of strxfrm() for abbreviated keys */
-bool		trust_strxfrm = false;
-
 typedef struct varlena unknown;
 typedef struct varlena VarString;
 
@@ -1563,10 +1560,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1583,171 +1576,16 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
 		if (result == 0 &&
 			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		{
+			result = memcmp(arg1, arg2, Min(len1, len2));
+			if ((result == 0) && (len1 != len2))
+				result = (len1 < len2) ? -1 : 1;
+		}
 	}
 
 	return result;
@@ -2083,20 +1921,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -2112,13 +1936,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can set the trust_strxfrm GUC to true.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2131,8 +1949,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-	if (!trust_strxfrm && !collate_c &&
-		!(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
 
 	/*
@@ -2404,60 +2221,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
 	if (result == 0 &&
@@ -2480,6 +2244,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2492,7 +2257,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2527,14 +2292,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2552,7 +2313,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2560,81 +2321,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2646,12 +2375,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..def2b55f94 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,19 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v9-0003-Refactor-pg_locale_t-routines.patchtext/x-patch; charset=UTF-8; name=v9-0003-Refactor-pg_locale_t-routines.patchDownload

From 6dab3a2aa800870566fb7242c085220741f0c1aa Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 5 Dec 2022 10:43:52 -0800
Subject: [PATCH v9 3/3] Refactor pg_locale_t routines.

  * add pg_locale_internal.h to hide pg_locale_struct
  * move info.lt into info.libc.lt to match icu
  * introduce init_default_locale()
  * introduce pg_locale_deterministic() accessor
  * make default_locale a static global in pg_locale.c
  * refactor pg_newlocale_from_collation()
---
 src/backend/access/hash/hashfunc.c     |  82 +++---
 src/backend/commands/collationcmds.c   |   1 +
 src/backend/regex/regc_pg_locale.c     |  45 ++--
 src/backend/utils/adt/formatting.c     |  25 +-
 src/backend/utils/adt/like.c           |   3 +-
 src/backend/utils/adt/like_support.c   |   3 +-
 src/backend/utils/adt/pg_locale.c      | 342 +++++++++++++++----------
 src/backend/utils/adt/varchar.c        |  62 ++---
 src/backend/utils/adt/varlena.c        |  14 +-
 src/backend/utils/init/postinit.c      |  29 ++-
 src/include/utils/pg_locale.h          |  55 +---
 src/include/utils/pg_locale_internal.h |  68 +++++
 12 files changed, 402 insertions(+), 327 deletions(-)
 create mode 100644 src/include/utils/pg_locale_internal.h

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,36 +282,28 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -337,7 +329,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 6a4311cc63..d878be2fb8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -36,6 +36,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 323f00516d..3dc89b0336 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -17,6 +17,7 @@
 
 #include "catalog/pg_collation.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 
 /*
  * To provide as much functionality as possible on a variety of platforms,
@@ -306,13 +307,13 @@ pg_wc_isdigit(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswdigit_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswdigit_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isdigit_l((unsigned char) c, pg_regex_locale->info.lt));
+					isdigit_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -342,13 +343,13 @@ pg_wc_isalpha(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalpha_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalpha_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalpha_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalpha_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -378,13 +379,13 @@ pg_wc_isalnum(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswalnum_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswalnum_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isalnum_l((unsigned char) c, pg_regex_locale->info.lt));
+					isalnum_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -423,13 +424,13 @@ pg_wc_isupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isupper_l((unsigned char) c, pg_regex_locale->info.lt));
+					isupper_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -459,13 +460,13 @@ pg_wc_islower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					islower_l((unsigned char) c, pg_regex_locale->info.lt));
+					islower_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -495,13 +496,13 @@ pg_wc_isgraph(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswgraph_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswgraph_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isgraph_l((unsigned char) c, pg_regex_locale->info.lt));
+					isgraph_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -531,13 +532,13 @@ pg_wc_isprint(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswprint_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswprint_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isprint_l((unsigned char) c, pg_regex_locale->info.lt));
+					isprint_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -567,13 +568,13 @@ pg_wc_ispunct(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswpunct_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswpunct_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					ispunct_l((unsigned char) c, pg_regex_locale->info.lt));
+					ispunct_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -603,13 +604,13 @@ pg_wc_isspace(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return iswspace_l((wint_t) c, pg_regex_locale->info.lt);
+				return iswspace_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			return (c <= (pg_wchar) UCHAR_MAX &&
-					isspace_l((unsigned char) c, pg_regex_locale->info.lt));
+					isspace_l((unsigned char) c, pg_regex_locale->info.libc.lt));
 #endif
 			break;
 		case PG_REGEX_LOCALE_ICU:
@@ -647,13 +648,13 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
+				return towupper_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
+				return toupper_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
@@ -691,13 +692,13 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_LOCALE_WIDE_L:
 #ifdef HAVE_LOCALE_T
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
-				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
+				return towlower_l((wint_t) c, pg_regex_locale->info.libc.lt);
 #endif
 			/* FALL THRU */
 		case PG_REGEX_LOCALE_1BYTE_L:
 #ifdef HAVE_LOCALE_T
 			if (c <= (pg_wchar) UCHAR_MAX)
-				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
+				return tolower_l((unsigned char) c, pg_regex_locale->info.libc.lt);
 #endif
 			return c;
 		case PG_REGEX_LOCALE_ICU:
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index f3f4db5ef6..f45e2043a1 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -88,6 +88,7 @@
 #include "utils/memutils.h"
 #include "utils/numeric.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -1577,7 +1578,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 	*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 	status = U_ZERO_ERROR;
 	len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-					mylocale->info.icu.locale, &status);
+					mylocale->ctype, &status);
 	if (status == U_BUFFER_OVERFLOW_ERROR)
 	{
 		/* try again with adjusted length */
@@ -1585,7 +1586,7 @@ icu_convert_case(ICU_Convert_Func func, pg_locale_t mylocale,
 		*buff_dest = palloc(len_dest * sizeof(**buff_dest));
 		status = U_ZERO_ERROR;
 		len_dest = func(*buff_dest, len_dest, buff_source, len_source,
-						mylocale->info.icu.locale, &status);
+						mylocale->ctype, &status);
 	}
 	if (U_FAILURE(status))
 		ereport(ERROR,
@@ -1698,7 +1699,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towlower(workspace[curr_char]);
@@ -1731,7 +1732,7 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+						*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_tolower((unsigned char) *p);
@@ -1820,7 +1821,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
+						workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
 					else
 #endif
 						workspace[curr_char] = towupper(workspace[curr_char]);
@@ -1853,7 +1854,7 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
 				{
 #ifdef HAVE_LOCALE_T
 					if (mylocale)
-						*p = toupper_l((unsigned char) *p, mylocale->info.lt);
+						*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
 					else
 #endif
 						*p = pg_toupper((unsigned char) *p);
@@ -1945,10 +1946,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towlower_l(workspace[curr_char], mylocale->info.libc.lt);
 						else
-							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.lt);
-						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.lt);
+							workspace[curr_char] = towupper_l(workspace[curr_char], mylocale->info.libc.lt);
+						wasalnum = iswalnum_l(workspace[curr_char], mylocale->info.libc.lt);
 					}
 					else
 #endif
@@ -1990,10 +1991,10 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
 					if (mylocale)
 					{
 						if (wasalnum)
-							*p = tolower_l((unsigned char) *p, mylocale->info.lt);
+							*p = tolower_l((unsigned char) *p, mylocale->info.libc.lt);
 						else
-							*p = toupper_l((unsigned char) *p, mylocale->info.lt);
-						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.lt);
+							*p = toupper_l((unsigned char) *p, mylocale->info.libc.lt);
+						wasalnum = isalnum_l((unsigned char) *p, mylocale->info.libc.lt);
 					}
 					else
 #endif
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index fc6cb7f5b7..6fdfa2cebd 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -24,6 +24,7 @@
 #include "miscadmin.h"
 #include "utils/builtins.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "varatt.h"
 
 
@@ -97,7 +98,7 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 		return pg_ascii_tolower(c);
 #ifdef HAVE_LOCALE_T
 	else if (locale)
-		return tolower_l(c, locale->info.lt);
+		return tolower_l(c, locale->info.libc.lt);
 #endif
 	else
 		return pg_tolower(c);
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 9b603d42f3..43150741c8 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -52,6 +52,7 @@
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/selfuncs.h"
 #include "utils/varlena.h"
 
@@ -1511,7 +1512,7 @@ pattern_char_isalpha(char c, bool is_multibyte,
 			(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
 #ifdef HAVE_LOCALE_T
 	else if (locale && locale->provider == COLLPROVIDER_LIBC)
-		return isalpha_l((unsigned char) c, locale->info.lt);
+		return isalpha_l((unsigned char) c, locale->info.libc.lt);
 #endif
 	else
 		return isalpha((unsigned char) c);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 509ef0784e..d3ca3c5011 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -65,6 +65,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
+#include "utils/pg_locale_internal.h"
 #include "utils/syscache.h"
 
 #ifdef USE_ICU
@@ -131,6 +132,11 @@ static HTAB *collation_cache = NULL;
 static char *IsoLocaleName(const char *);
 #endif
 
+/*
+ * Database default locale.
+ */
+static pg_locale_t default_locale = NULL;
+
 #ifdef USE_ICU
 /*
  * Converter object for converting between ICU's UChar strings and C strings
@@ -1336,7 +1342,7 @@ lc_collate_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1389,7 +1395,7 @@ lc_ctype_is_c(Oid collation)
 		static int	result = -1;
 		char	   *localeptr;
 
-		if (default_locale.provider == COLLPROVIDER_ICU)
+		if (default_locale->provider == COLLPROVIDER_ICU)
 			return false;
 
 		if (result >= 0)
@@ -1420,38 +1426,6 @@ lc_ctype_is_c(Oid collation)
 	return (lookup_collation_cache(collation, true))->ctype_is_c;
 }
 
-struct pg_locale_struct default_locale;
-
-void
-make_icu_collator(const char *iculocstr,
-				  struct pg_locale_struct *resultp)
-{
-#ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
-
-	status = U_ZERO_ERROR;
-	collator = ucol_open(iculocstr, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("could not open collator for locale \"%s\": %s",
-						iculocstr, u_errorName(status))));
-
-	if (U_ICU_VERSION_MAJOR_NUM < 54)
-		icu_set_collation_attributes(collator, iculocstr);
-
-	/* We will leak this string if the caller errors later :-( */
-	resultp->info.icu.locale = MemoryContextStrdup(TopMemoryContext, iculocstr);
-	resultp->info.icu.ucol = collator;
-#else							/* not USE_ICU */
-	/* could get here if a collation was created by a build with ICU */
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("ICU is not supported in this build")));
-#endif							/* not USE_ICU */
-}
-
-
 /* simple subroutine for reporting errors from newlocale() */
 #ifdef HAVE_LOCALE_T
 static void
@@ -1485,6 +1459,163 @@ report_newlocale_failure(const char *localename)
 #endif							/* HAVE_LOCALE_T */
 
 
+/*
+ * Construct a new pg_locale_t object.
+ *
+ * Passing NULL for the version is allowed; and even if it is specified, the
+ * result may or may not have an exactly matching version. Other parameters
+ * are required.
+ *
+ * Ordinarily, collate and ctype should be the same. If the provider is ICU,
+ * this is a requirement, and the 'collate' and 'ctype' arguments should both
+ * come from colliculocale (or daticulocale). If the provider is libc, the
+ * arguments should come from collcollate and collctype (or datcollate and
+ * datctype), respectively.
+ *
+ * Structures are allocated in TopMemoryContext, and the libc locale_t or
+ * UCollator is not allocated in any memory context, so the result is
+ * effectively permanent.
+ */
+static pg_locale_t
+pg_newlocale(char provider, bool deterministic, const char *collate,
+			 const char *ctype, const char *version)
+{
+	pg_locale_t result = MemoryContextAlloc(TopMemoryContext,
+											sizeof(struct pg_locale_struct));
+
+	/*
+	 * If COLLPROVIDER_DEFAULT, caller should use default_locale or NULL
+	 * instead.
+	 */
+	Assert(provider != COLLPROVIDER_DEFAULT);
+
+	result->provider = provider;
+	result->deterministic = deterministic;
+	result->collate = MemoryContextStrdup(TopMemoryContext, collate);
+	result->ctype = MemoryContextStrdup(TopMemoryContext, ctype);
+
+	if (provider == COLLPROVIDER_LIBC)
+	{
+#ifdef HAVE_LOCALE_T
+		locale_t        loc;
+
+		/* newlocale's result may be leaked if we encounter an error */
+
+		if (strcmp(collate, ctype) == 0)
+		{
+			/* Normal case where they're the same */
+			errno = 0;
+#ifndef WIN32
+			loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collate,
+							NULL);
+#else
+			loc = _create_locale(LC_ALL, collate);
+#endif
+			if (!loc)
+				report_newlocale_failure(collate);
+		}
+		else
+		{
+#ifndef WIN32
+			/* We need two newlocale() steps */
+			locale_t	loc1;
+
+			errno = 0;
+			loc1 = newlocale(LC_COLLATE_MASK, collate, NULL);
+			if (!loc1)
+				report_newlocale_failure(collate);
+			errno = 0;
+			loc = newlocale(LC_CTYPE_MASK, ctype, loc1);
+			if (!loc)
+				report_newlocale_failure(ctype);
+#else
+
+			/*
+			 * XXX The _create_locale() API doesn't appear to support
+			 * this. Could perhaps be worked around by changing
+			 * pg_locale_t to contain two separate fields.
+			 */
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("collations with different collate and ctype values are not supported on this platform")));
+#endif
+		}
+
+		result->info.libc.lt = loc;
+#else							/* not HAVE_LOCALE_T */
+		/* platform that doesn't support locale_t */
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("collation provider LIBC is not supported on this platform")));
+#endif							/* not HAVE_LOCALE_T */
+	}
+#ifdef USE_ICU
+	else if (provider == COLLPROVIDER_ICU)
+	{
+		UCollator  *collator;
+		UErrorCode	status;
+
+		/* collator may be leaked if we encounter an error */
+
+		status = U_ZERO_ERROR;
+		collator = ucol_open(collate, &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("could not open collator for locale \"%s\": %s",
+							collate, u_errorName(status))));
+
+		if (U_ICU_VERSION_MAJOR_NUM < 54)
+			icu_set_collation_attributes(collator, collate);
+
+		result->info.icu.ucol = collator;
+	}
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", provider);
+
+	return result;
+}
+
+/*
+ * Accessor so that callers don't need to include pg_locale_internal.h.
+ */
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
+
+/*
+ * Initialize default database locale.
+ */
+void
+init_default_locale(char provider, const char *collate, const char *ctype,
+					const char *iculocale, const char *version)
+{
+	/*
+	 * For the purposes of pg_locale_t, if the provider is ICU, we use
+	 * iculocale for both collate and ctype.
+	 */
+	if (provider == COLLPROVIDER_ICU)
+	{
+		collate = iculocale;
+		ctype = iculocale;
+	}
+	else
+		Assert(iculocale == NULL);
+
+	/*
+	 * Default locale is currently always deterministic.  Nondeterministic
+	 * locales currently don't support pattern matching, which would break a
+	 * lot of things if applied globally.
+	 */
+	default_locale = pg_newlocale(provider, true, collate, ctype, version);
+}
+
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
  * lifetime of the backend.  Thus, do not free the result with freelocale().
@@ -1509,8 +1640,8 @@ pg_newlocale_from_collation(Oid collid)
 
 	if (collid == DEFAULT_COLLATION_OID)
 	{
-		if (default_locale.provider == COLLPROVIDER_ICU)
-			return &default_locale;
+		if (default_locale->provider == COLLPROVIDER_ICU)
+			return default_locale;
 		else
 			return (pg_locale_t) 0;
 	}
@@ -1522,107 +1653,64 @@ pg_newlocale_from_collation(Oid collid)
 		/* We haven't computed this yet in this session, so do it */
 		HeapTuple	tp;
 		Form_pg_collation collform;
-		struct pg_locale_struct result;
-		pg_locale_t resultp;
+		pg_locale_t locale;
 		Datum		datum;
 		bool		isnull;
+		char	   *collate;
+		char	   *ctype;
+		char	   *collversionstr;
 
 		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collid));
 		if (!HeapTupleIsValid(tp))
 			elog(ERROR, "cache lookup failed for collation %u", collid);
 		collform = (Form_pg_collation) GETSTRUCT(tp);
 
-		/* We'll fill in the result struct locally before allocating memory */
-		memset(&result, 0, sizeof(result));
-		result.provider = collform->collprovider;
-		result.deterministic = collform->collisdeterministic;
+		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
+								&isnull);
+		if (!isnull)
+			collversionstr = TextDatumGetCString(datum);
+		else
+			collversionstr = NULL;
 
 		if (collform->collprovider == COLLPROVIDER_LIBC)
 		{
-#ifdef HAVE_LOCALE_T
-			const char *collcollate;
-			const char *collctype pg_attribute_unused();
-			locale_t	loc;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collcollate,
+									&isnull);
 			Assert(!isnull);
-			collcollate = TextDatumGetCString(datum);
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype, &isnull);
+			collate = TextDatumGetCString(datum);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collctype,
+									&isnull);
 			Assert(!isnull);
-			collctype = TextDatumGetCString(datum);
-
-			if (strcmp(collcollate, collctype) == 0)
-			{
-				/* Normal case where they're the same */
-				errno = 0;
-#ifndef WIN32
-				loc = newlocale(LC_COLLATE_MASK | LC_CTYPE_MASK, collcollate,
-								NULL);
-#else
-				loc = _create_locale(LC_ALL, collcollate);
-#endif
-				if (!loc)
-					report_newlocale_failure(collcollate);
-			}
-			else
-			{
-#ifndef WIN32
-				/* We need two newlocale() steps */
-				locale_t	loc1;
-
-				errno = 0;
-				loc1 = newlocale(LC_COLLATE_MASK, collcollate, NULL);
-				if (!loc1)
-					report_newlocale_failure(collcollate);
-				errno = 0;
-				loc = newlocale(LC_CTYPE_MASK, collctype, loc1);
-				if (!loc)
-					report_newlocale_failure(collctype);
-#else
-
-				/*
-				 * XXX The _create_locale() API doesn't appear to support
-				 * this. Could perhaps be worked around by changing
-				 * pg_locale_t to contain two separate fields.
-				 */
-				ereport(ERROR,
-						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-						 errmsg("collations with different collate and ctype values are not supported on this platform")));
-#endif
-			}
-
-			result.info.lt = loc;
-#else							/* not HAVE_LOCALE_T */
-			/* platform that doesn't support locale_t */
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("collation provider LIBC is not supported on this platform")));
-#endif							/* not HAVE_LOCALE_T */
+			ctype = TextDatumGetCString(datum);
 		}
+#ifdef USE_ICU
 		else if (collform->collprovider == COLLPROVIDER_ICU)
 		{
-			const char *iculocstr;
-
-			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale, &isnull);
+			datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_colliculocale,
+									&isnull);
 			Assert(!isnull);
-			iculocstr = TextDatumGetCString(datum);
-			make_icu_collator(iculocstr, &result);
+			collate = TextDatumGetCString(datum);
+
+			/* for ICU, collate and ctype are both set from iculocale */
+			ctype = collate;
 		}
+#endif
+		else
+			/* shouldn't happen */
+			elog(ERROR, "unsupported collprovider: %c", collform->collprovider);
 
-		datum = SysCacheGetAttr(COLLOID, tp, Anum_pg_collation_collversion,
-								&isnull);
-		if (!isnull)
+		locale = pg_newlocale(collform->collprovider,
+							  collform->collisdeterministic,
+							  collate, ctype, collversionstr);
+
+		ReleaseSysCache(tp);
+
+		if (collversionstr != NULL)
 		{
 			char	   *actual_versionstr;
-			char	   *collversionstr;
 
-			collversionstr = TextDatumGetCString(datum);
-
-			datum = SysCacheGetAttr(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate, &isnull);
-			Assert(!isnull);
+			actual_versionstr = get_collation_actual_version(collform->collprovider, collate);
 
-			actual_versionstr = get_collation_actual_version(collform->collprovider,
-															 TextDatumGetCString(datum));
 			if (!actual_versionstr)
 			{
 				/*
@@ -1649,13 +1737,7 @@ pg_newlocale_from_collation(Oid collid)
 															NameStr(collform->collname)))));
 		}
 
-		ReleaseSysCache(tp);
-
-		/* We'll keep the pg_locale_t structures in TopMemoryContext */
-		resultp = MemoryContextAlloc(TopMemoryContext, sizeof(*resultp));
-		*resultp = result;
-
-		cache_entry->locale = resultp;
+		cache_entry->locale = locale;
 	}
 
 	return cache_entry->locale;
@@ -1815,7 +1897,7 @@ pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
 	errno = 0;
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.libc.lt);
 	else
 #endif
 		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
@@ -1858,7 +1940,7 @@ pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
 	if (locale)
 	{
 #ifdef HAVE_LOCALE_T
-		result = strcoll_l(arg1, arg2, locale->info.lt);
+		result = strcoll_l(arg1, arg2, locale->info.libc.lt);
 #else
 		/* shouldn't happen */
 		elog(ERROR, "unsupported collprovider: %c", locale->provider);
@@ -2102,7 +2184,7 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 #ifdef TRUST_STXFRM
 #ifdef HAVE_LOCALE_T
 	if (locale)
-		return strxfrm_l(dest, src, destsize, locale->info.lt);
+		return strxfrm_l(dest, src, destsize, locale->info.libc.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
@@ -2699,8 +2781,8 @@ void
 check_icu_locale(const char *icu_locale)
 {
 #ifdef USE_ICU
-	UCollator  *collator;
-	UErrorCode	status;
+	UCollator	*collator;
+	UErrorCode   status;
 
 	status = U_ZERO_ERROR;
 	collator = ucol_open(icu_locale, &status);
@@ -2774,10 +2856,10 @@ wchar2char(char *to, const wchar_t *from, size_t tolen, pg_locale_t locale)
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_WCSTOMBS_L
 		/* Use wcstombs_l for nondefault locales */
-		result = wcstombs_l(to, from, tolen, locale->info.lt);
+		result = wcstombs_l(to, from, tolen, locale->info.libc.lt);
 #else							/* !HAVE_WCSTOMBS_L */
 		/* We have to temporarily set the locale as current ... ugh */
-		locale_t	save_locale = uselocale(locale->info.lt);
+		locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 		result = wcstombs(to, from, tolen);
 
@@ -2851,10 +2933,10 @@ char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen,
 #ifdef HAVE_LOCALE_T
 #ifdef HAVE_MBSTOWCS_L
 			/* Use mbstowcs_l for nondefault locales */
-			result = mbstowcs_l(to, str, tolen, locale->info.lt);
+			result = mbstowcs_l(to, str, tolen, locale->info.libc.lt);
 #else							/* !HAVE_MBSTOWCS_L */
 			/* We have to temporarily set the locale as current ... ugh */
-			locale_t	save_locale = uselocale(locale->info.lt);
+			locale_t	save_locale = uselocale(locale->info.libc.lt);
 
 			result = mbstowcs(to, str, tolen);
 
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 5bbba94687..8dc1faaaf2 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,33 +1015,25 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1072,35 +1064,27 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 5d1bf45009..a6459c83c0 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1228,7 +1228,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1579,8 +1579,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
+		if (result == 0 && pg_locale_deterministic(mylocale))
 		{
 			result = memcmp(arg1, arg2, Min(len1, len2));
 			if ((result == 0) && (len1 != len2))
@@ -1635,7 +1634,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1694,7 +1693,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1808,7 +1807,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -2224,8 +2223,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
+	if (result == 0 && pg_locale_deterministic(sss->locale))
 		result = strcmp(sss->buf1, sss->buf2);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..994234a521 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -317,6 +317,7 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 	char	   *collate;
 	char	   *ctype;
 	char	   *iculocale;
+	char	   *collversionstr;
 
 	/* Fetch our pg_database row normally, via syscache */
 	tup = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
@@ -424,35 +425,33 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 		datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_daticulocale, &isnull);
 		Assert(!isnull);
 		iculocale = TextDatumGetCString(datum);
-		make_icu_collator(iculocale, &default_locale);
 	}
 	else
 		iculocale = NULL;
 
-	default_locale.provider = dbform->datlocprovider;
+	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
+							&isnull);
+	if (!isnull)
+		collversionstr = TextDatumGetCString(datum);
+	else
+		collversionstr = NULL;
 
-	/*
-	 * Default locale is currently always deterministic.  Nondeterministic
-	 * locales currently don't support pattern matching, which would break a
-	 * lot of things if applied globally.
-	 */
-	default_locale.deterministic = true;
+	init_default_locale(dbform->datlocprovider, collate, ctype, iculocale,
+						collversionstr);
 
 	/*
 	 * Check collation version.  See similar code in
 	 * pg_newlocale_from_collation().  Note that here we warn instead of error
 	 * in any case, so that we don't prevent connecting.
 	 */
-	datum = SysCacheGetAttr(DATABASEOID, tup, Anum_pg_database_datcollversion,
-							&isnull);
-	if (!isnull)
+	if (collversionstr != NULL)
 	{
 		char	   *actual_versionstr;
-		char	   *collversionstr;
 
-		collversionstr = TextDatumGetCString(datum);
+		actual_versionstr = get_collation_actual_version(
+			dbform->datlocprovider,
+			dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 
-		actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
 		if (!actual_versionstr)
 			/* should not happen */
 			elog(WARNING,
@@ -470,6 +469,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
 							 "or build PostgreSQL with the right library version.",
 							 quote_identifier(name))));
 	}
+	else
+		collversionstr = NULL;
 
 	/* Make the locale settings visible as GUC variables, too */
 	SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index def2b55f94..56872037de 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -15,22 +15,6 @@
 #if defined(LOCALE_T_IN_XLOCALE) || defined(WCSTOMBS_L_IN_XLOCALE)
 #include <xlocale.h>
 #endif
-#ifdef USE_ICU
-#include <unicode/ucol.h>
-#endif
-
-#ifdef USE_ICU
-/*
- * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
- * (see
- * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
- */
-#if U_ICU_VERSION_MAJOR_NUM >= 53
-#define HAVE_UCOL_STRCOLLUTF8 1
-#else
-#undef HAVE_UCOL_STRCOLLUTF8
-#endif
-#endif
 
 /* use for libc locale names */
 #define LOCALE_NAME_BUFLEN 128
@@ -64,39 +48,12 @@ extern struct lconv *PGLC_localeconv(void);
 extern void cache_locale_time(void);
 
 
-/*
- * We define our own wrapper around locale_t so we can keep the same
- * function signatures for all builds, while not having to create a
- * fake version of the standard type locale_t in the global namespace.
- * pg_locale_t is occasionally checked for truth, so make it a pointer.
- */
-struct pg_locale_struct
-{
-	char		provider;
-	bool		deterministic;
-	union
-	{
-#ifdef HAVE_LOCALE_T
-		locale_t	lt;
-#endif
-#ifdef USE_ICU
-		struct
-		{
-			const char *locale;
-			UCollator  *ucol;
-		}			icu;
-#endif
-		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
-	}			info;
-};
-
 typedef struct pg_locale_struct *pg_locale_t;
 
-extern PGDLLIMPORT struct pg_locale_struct default_locale;
-
-extern void make_icu_collator(const char *iculocstr,
-							  struct pg_locale_struct *resultp);
-
+extern void init_default_locale(char provider, const char *collate,
+								const char *ctype, const char *iculocale,
+								const char *version);
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
@@ -114,10 +71,6 @@ extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
 extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
 								 size_t srclen, pg_locale_t locale);
 
-#ifdef USE_ICU
-extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
-#endif
 extern void check_icu_locale(const char *icu_locale);
 
 /* These functions convert from/to libc's wchar_t, *not* pg_wchar_t */
diff --git a/src/include/utils/pg_locale_internal.h b/src/include/utils/pg_locale_internal.h
new file mode 100644
index 0000000000..33465ad92d
--- /dev/null
+++ b/src/include/utils/pg_locale_internal.h
@@ -0,0 +1,68 @@
+/*-----------------------------------------------------------------------
+ *
+ * PostgreSQL locale utilities
+ *
+ * src/include/utils/pg_locale_internal.h
+ *
+ * Copyright (c) 2002-2022, PostgreSQL Global Development Group
+ *
+ *-----------------------------------------------------------------------
+ */
+
+
+#ifndef _PG_LOCALE_INTERNAL_
+#define _PG_LOCALE_INTERNAL_
+
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
+
+#ifdef USE_ICU
+/*
+ * ucol_strcollUTF8() was introduced in ICU 50, but it is buggy before ICU 53.
+ * (see
+ * <https://www.postgresql.org/message-id/flat/f1438ec6-22aa-4029-9a3b-26f79d330e72%40manitou-mail.org>)
+ */
+#if U_ICU_VERSION_MAJOR_NUM >= 53
+#define HAVE_UCOL_STRCOLLUTF8 1
+#else
+#undef HAVE_UCOL_STRCOLLUTF8
+#endif
+#endif
+
+/*
+ * We define our own wrapper around locale_t so we can keep the same
+ * function signatures for all builds, while not having to create a
+ * fake version of the standard type locale_t in the global namespace.
+ * pg_locale_t is occasionally checked for truth, so make it a pointer.
+ */
+struct pg_locale_struct
+{
+	char		provider;
+	bool		deterministic;
+	char	   *collate;
+	char	   *ctype;
+	union
+	{
+#ifdef HAVE_LOCALE_T
+		struct
+		{
+			locale_t	lt;
+		}			libc;
+#endif
+#ifdef USE_ICU
+		struct
+		{
+			UCollator	*ucol;
+		}			icu;
+#endif
+		int			dummy;		/* in case we have neither LOCALE_T nor ICU */
+	}			info;
+};
+
+#ifdef USE_ICU
+extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
+extern int32_t icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar);
+#endif
+
+#endif							/* _PG_LOCALE_INTERNAL_ */
-- 
2.34.1

#18

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#17)

Re: Rework of collation code, extensibility

On 27.01.23 00:47, Jeff Davis wrote:

I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two,
please let me know if you want me to hold off. (I won't commit the GUCs
unless others find them generally useful; they are included here to
more easily reproduce my performance tests.)

I have looked a bit at 0002 and 0003. I like the direction. I'll spend
a bit more time reviewing it in detail. It moves a lot of code around.

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

#19

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#18)

Re: Rework of collation code, extensibility

On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.

Regards,
Jeff Davis

#20

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#19)

Re: Rework of collation code, extensibility

On 01.02.23 00:33, Jeff Davis wrote:

On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote:

I don't know to what extent this depends on the abbreviated key GUC
discussion. Does the rest of this patch set depend on this?

The overall refactoring is not dependent logically on the GUC patch. It
may require some trivial fixup if you eliminate the GUC patch.

I left it there because it makes exploring/testing easier (at least for
me), but the GUC patch doesn't need to be committed if there's no
consensus.

I took another closer look at the 0002 and 0003 patches.

The commit message for 0002 says "Also remove the TRUST_STRXFRM define",
but I think this is incorrect, as that is done in the 0001 patch.

I don't like that the pg_strnxfrm() function requires these kinds of
repetitive error checks:

+           if (rsize != bsize)
+               elog(ERROR, "pg_strnxfrm() returned unexpected result");

This could be checked inside the function itself, so that the callers
don't have to do this themselves every time.

I don't really understand the 0003 patch. It's a lot of churn but I'm
not sure that it achieves more clarity or something.

The new function pg_locale_deterministic() seems sensible. Maybe this
could be proposed as a separate patch.

I don't understand the new header pg_locale_internal.h. What is
"internal" and what is not internal? What are we hiding from whom?
There are no code comments about this AFAICT.

pg_locale_struct has new fields

+ char *collate;
+ char *ctype;

that are not explained anywhere.

I think this patch would need a bit more explanation and commenting.

#21

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#20)

4 attachment(s)

Re: Rework of collation code, extensibility

New version attached. Changes:

* I moved the GUC patch to the end (so you can ignore it if it's not
useful for review)
* I cut out the pg_locale_internal.h rearrangement (at least for now,
it might seem useful after the dust settles on the other changes).
* I added a separate patch for pg_locale_deterministic().
* I added a separate patch for a simple cleanup of a USE_ICU special
case.

Now the patches are:

0001: pg_strcoll/pg_strxfrm
0002: pg_locale_deterministic()
0003: cleanup a USE_ICU special case
0004: GUCs (only for testing, not for commit)

Responses to your review comments inline below:

On Mon, 2023-02-13 at 11:35 +0100, Peter Eisentraut wrote:

The commit message for 0002 says "Also remove the TRUST_STRXFRM
define",
but I think this is incorrect, as that is done in the 0001 patch.

Fixed.

I don't like that the pg_strnxfrm() function requires these kinds of
repetitive error checks:
+           if (rsize != bsize)
+               elog(ERROR, "pg_strnxfrm() returned unexpected
result");
This could be checked inside the function itself, so that the callers
don't have to do this themselves every time.

The current API allows for a pattern like:

/* avoids extra work if existing buffer is big enough */
len = pg_strxfrm(buf, src, bufSize, loc);
if (len >= bufSize)
{
buf = repalloc(len+1);
bufSize = len+1;
len2 = pg_strxfrm(buf, src, bufSize, loc);
}

The test for rsize != bsize are just there to check that the underlying
library calls (strxfrm or getSortKey) behave as documented, and we
expect that they'd never be hit. It's hard to move that kind of check
into pg_strxfrm() without making it also manage the buffers.

Do you have a more specific suggestion? I'd like to keep the API
flexible enough that the caller can manage the buffers, like with
abbreviated keys. Perhaps the check can just be removed if we trust
that the library functions at least get the size calculation right? Or
turned into an Assert?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v10-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchtext/x-patch; charset=UTF-8; name=v10-0001-Add-pg_strcoll-pg_strxfrm-and-variants.patchDownload

From 5fedf2efffa4a75d8de974c41af084c5a028551d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 1 Dec 2022 14:45:15 -0800
Subject: [PATCH v10 1/4] Add pg_strcoll(), pg_strxfrm(), and variants.

Towards multi-lib ICU, which should be based on a clean separation of
the routines required for collation providers. Also offers a generally
better separation of responsibilities.

Callers with NUL-terminated strings should call pg_strcoll() or
pg_strxfrm(); callers with strings and their length should call the
variants pg_strncoll() or pg_strnxfrm().
---
 src/backend/access/hash/hashfunc.c |  45 +-
 src/backend/utils/adt/pg_locale.c  | 784 ++++++++++++++++++++++++++++-
 src/backend/utils/adt/varchar.c    |  41 +-
 src/backend/utils/adt/varlena.c    | 368 ++------------
 src/include/utils/pg_locale.h      |  13 +
 5 files changed, 863 insertions(+), 388 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index e3e40d6c21..c0ed995919 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -292,21 +292,19 @@ hashtext(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -350,21 +348,20 @@ hashtextextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
+			const char *keydata = VARDATA_ANY(key);
+			size_t		keylen = VARSIZE_ANY_EXHDR(key);
 
-			ulen = icu_to_uchar(&uchar, VARDATA_ANY(key), VARSIZE_ANY_EXHDR(key));
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 059e4fd79f..0722b1ca3b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -79,6 +79,12 @@
 #include <shlwapi.h>
 #endif
 
+/*
+ * This should be large enough that most strings will fit, but small enough
+ * that we feel comfortable putting it on the stack
+ */
+#define		TEXTBUFLEN			1024
+
 #define		MAX_L10N_DATA		80
 
 
@@ -123,6 +129,19 @@ static char *IsoLocaleName(const char *);
 #endif
 
 #ifdef USE_ICU
+/*
+ * Converter object for converting between ICU's UChar strings and C strings
+ * in database encoding.  Since the database encoding doesn't change, we only
+ * need one of these per session.
+ */
+static UConverter *icu_converter = NULL;
+
+static void init_icu_converter(void);
+static size_t uchar_length(UConverter *converter,
+						   const char *str, size_t len);
+static int32_t uchar_convert(UConverter *converter,
+							 UChar *dest, int32_t destlen,
+							 const char *str, size_t srclen);
 static void icu_set_collation_attributes(UCollator *collator, const char *loc);
 #endif
 
@@ -1731,15 +1750,720 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 	return collversion;
 }
 
+/*
+ * pg_strncoll_libc_win32_utf8
+ *
+ * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
+ * invoke wcscoll() or wcscoll_l().
+ */
+#ifdef WIN32
+static int
+pg_strncoll_libc_win32_utf8(const char *arg1, size_t len1, const char *arg2,
+							size_t len2, pg_locale_t locale)
+{
+	char		sbuf[TEXTBUFLEN];
+	char	   *buf = sbuf;
+	char	   *a1p,
+			   *a2p;
+	int			a1len = len1 * 2 + 2;
+	int			a2len = len2 * 2 + 2;
+	int			r;
+	int			result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	Assert(GetDatabaseEncoding() == PG_UTF8);
+#ifndef WIN32
+	Assert(false);
+#endif
+
+	if (a1len + a2len > TEXTBUFLEN)
+		buf = palloc(a1len + a2len);
+
+	a1p = buf;
+	a2p = buf + a1len;
+
+	/* API does not work for zero-length input */
+	if (len1 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
+								(LPWSTR) a1p, a1len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a1p)[r] = 0;
+
+	if (len2 == 0)
+		r = 0;
+	else
+	{
+		r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
+								(LPWSTR) a2p, a2len / 2);
+		if (!r)
+			ereport(ERROR,
+					(errmsg("could not convert string to UTF-16: error code %lu",
+							GetLastError())));
+	}
+	((LPWSTR) a2p)[r] = 0;
+
+	errno = 0;
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, locale->info.lt);
+	else
+#endif
+		result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
+	if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
+								 * headers */
+		ereport(ERROR,
+				(errmsg("could not compare Unicode strings: %m")));
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+#endif							/* WIN32 */
+
+/*
+ * pg_strcoll_libc
+ *
+ * Call strcoll(), strcoll_l(), wcscoll(), or wcscoll_l() as appropriate for
+ * the given locale, platform, and database encoding. If the locale is NULL,
+ * use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ */
+static int
+pg_strcoll_libc(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+#ifdef WIN32
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		size_t len1 = strlen(arg1);
+		size_t len2 = strlen(arg2);
+		result = pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+	}
+	else
+#endif							/* WIN32 */
+	if (locale)
+	{
+#ifdef HAVE_LOCALE_T
+		result = strcoll_l(arg1, arg2, locale->info.lt);
+#else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+	}
+	else
+		result = strcoll(arg1, arg2);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_libc
+ *
+ * Null-terminate the arguments and call pg_strcoll_libc().
+ */
+static int
+pg_strncoll_libc(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	  = sbuf;
+	size_t	 bufsize1 = len1 + 1;
+	size_t	 bufsize2 = len2 + 1;
+	char	*arg1n;
+	char	*arg2n;
+	int		 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef WIN32
+	/* check for this case before doing the work for nul-termination */
+	if (GetDatabaseEncoding() == PG_UTF8)
+		return pg_strncoll_libc_win32_utf8(arg1, len1, arg2, len2, locale);
+#endif							/* WIN32 */
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	arg1n = buf;
+	arg2n = buf + bufsize1;
+
+	/* nul-terminate arguments */
+	memcpy(arg1n, arg1, len1);
+	arg1n[len1] = '\0';
+	memcpy(arg2n, arg2, len2);
+	arg2n[len2] = '\0';
+
+	result = pg_strcoll_libc(arg1n, arg2n, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
 
 #ifdef USE_ICU
+
 /*
- * Converter object for converting between ICU's UChar strings and C strings
- * in database encoding.  Since the database encoding doesn't change, we only
- * need one of these per session.
+ * pg_strncoll_icu_no_utf8
+ *
+ * Convert the arguments from the database encoding to UChar strings, then
+ * call ucol_strcoll().
+ *
+ * When the database encoding is UTF-8, and ICU supports ucol_strcollUTF8(),
+ * caller should call that instead.
  */
-static UConverter *icu_converter = NULL;
+static int
+pg_strncoll_icu_no_utf8(const char *arg1, size_t len1,
+						const char *arg2, size_t len2, pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf = sbuf;
+	int32_t	 ulen1;
+	int32_t	 ulen2;
+	size_t   bufsize1;
+	size_t   bufsize2;
+	UChar	*uchar1,
+			*uchar2;
+	int		 result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+#endif
+
+	init_icu_converter();
+
+	ulen1 = uchar_length(icu_converter, arg1, len1);
+	ulen2 = uchar_length(icu_converter, arg2, len2);
+
+	bufsize1 = (ulen1 + 1) * sizeof(UChar);
+	bufsize2 = (ulen2 + 1) * sizeof(UChar);
+
+	if (bufsize1 + bufsize2 > TEXTBUFLEN)
+		buf = palloc(bufsize1 + bufsize2);
+
+	uchar1 = (UChar *) buf;
+	uchar2 = (UChar *) (buf + bufsize1);
+
+	ulen1 = uchar_convert(icu_converter, uchar1, ulen1 + 1, arg1, len1);
+	ulen2 = uchar_convert(icu_converter, uchar2, ulen2 + 1, arg2, len2);
+
+	result = ucol_strcoll(locale->info.icu.ucol,
+						  uchar1, ulen1,
+						  uchar2, ulen2);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+/*
+ * pg_strncoll_icu
+ *
+ * Call ucol_strcollUTF8() or ucol_strcoll() as appropriate for the given
+ * database encoding.
+ *
+ * Arguments must be encoded in the database encoding.
+ */
+static int
+pg_strncoll_icu(const char *arg1, size_t len1, const char *arg2, size_t len2,
+				pg_locale_t locale)
+{
+	int result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+#ifdef HAVE_UCOL_STRCOLLUTF8
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UErrorCode	status;
+
+		status = U_ZERO_ERROR;
+		result = ucol_strcollUTF8(locale->info.icu.ucol,
+								  arg1, len1,
+								  arg2, len2,
+								  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("collation failed: %s", u_errorName(status))));
+	}
+	else
+#endif
+	{
+		result = pg_strncoll_icu_no_utf8(arg1, len1, arg2, len2, locale);
+	}
+
+	return result;
+}
+
+/*
+ * pg_strcoll_icu
+ *
+ * Calculate the string lengths and call pg_strncoll_icu().
+ */
+static int
+pg_strcoll_icu(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strncoll_icu(arg1, -1, arg2, -1, locale);
+}
+
+#endif							/* USE_ICU */
+
+/*
+ * pg_strcoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding and nul-terminated.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale)
+{
+	int			result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strcoll_libc(arg1, arg2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strcoll_icu(arg1, arg2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strncoll
+ *
+ * Call ucol_strcollUTF8(), ucol_strcoll(), strcoll(), strcoll_l(), wcscoll(),
+ * or wcscoll_l() as appropriate for the given locale, platform, and database
+ * encoding. If the locale is not specified, use the database collation.
+ *
+ * Arguments must be encoded in the database encoding.
+ *
+ * This function may need to nul-terminate the arguments for libc functions;
+ * so if the caller already has nul-terminated strings, it should call
+ * pg_strcoll() instead.
+ *
+ * The caller is responsible for breaking ties if the collation is
+ * deterministic; this maintains consistency with pg_strnxfrm(), which cannot
+ * easily account for deterministic collations.
+ */
+int
+pg_strncoll(const char *arg1, size_t len1, const char *arg2, size_t len2,
+			pg_locale_t locale)
+{
+	int		 result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strncoll_libc(arg1, len1, arg2, len2, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strncoll_icu(arg1, len1, arg2, len2, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+
+static size_t
+pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
+				pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+#ifdef TRUST_STRXFRM
+#ifdef HAVE_LOCALE_T
+	if (locale)
+		return strxfrm_l(dest, src, destsize, locale->info.lt);
+	else
+#endif
+		return strxfrm(dest, src, destsize);
+#else
+	/* shouldn't happen */
+	elog(ERROR, "unsupported collprovider: %c", locale->provider);
+#endif
+}
+
+static size_t
+pg_strnxfrm_libc(char *dest, const char *src, size_t srclen, size_t destsize,
+				 pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	 = sbuf;
+	size_t	 bufsize = srclen + 1;
+	size_t	 result;
+
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+
+	if (bufsize > TEXTBUFLEN)
+		buf = palloc(bufsize);
 
+	/* nul-terminate arguments */
+	memcpy(buf, src, srclen);
+	buf[srclen] = '\0';
+
+	result = pg_strxfrm_libc(dest, buf, destsize, locale);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result;
+}
+
+static size_t
+pg_strnxfrm_prefix_libc(char *dest, const char *src, size_t srclen,
+						size_t destsize, pg_locale_t locale)
+{
+	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
+	/* unsupported; shouldn't happen */
+	elog(ERROR, "collprovider '%c' does not support pg_strnxfrm_prefix()",
+		 locale->provider);
+}
+
+#ifdef USE_ICU
+
+static size_t
+pg_strnxfrm_icu(char *dest, const char *src, size_t srclen, size_t destsize,
+				pg_locale_t locale)
+{
+	char	 sbuf[TEXTBUFLEN];
+	char	*buf	= sbuf;
+	UChar	*uchar;
+	int32_t	 ulen;
+	size_t   uchar_bsize;
+	Size	 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	result_bsize = ucol_getSortKey(locale->info.icu.ucol,
+								   uchar, ulen,
+								   (uint8_t *) dest, destsize);
+
+	if (buf != sbuf)
+		pfree(buf);
+
+	return result_bsize;
+}
+
+static size_t
+pg_strxfrm_icu(char *dest, const char *src, size_t destsize,
+			   pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_icu(dest, src, -1, destsize, locale);
+}
+
+static size_t
+pg_strnxfrm_prefix_icu_no_utf8(char *dest, const char *src, size_t srclen,
+							   size_t destsize, pg_locale_t locale)
+{
+	char			 sbuf[TEXTBUFLEN];
+	char			*buf   = sbuf;
+	UCharIterator	 iter;
+	uint32_t		 state[2];
+	UErrorCode		 status;
+	int32_t			 ulen  = -1;
+	UChar			*uchar = NULL;
+	size_t			 uchar_bsize;
+	Size			 result_bsize;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	Assert(GetDatabaseEncoding() != PG_UTF8);
+
+	init_icu_converter();
+
+	ulen = uchar_length(icu_converter, src, srclen);
+
+	uchar_bsize = (ulen + 1) * sizeof(UChar);
+
+	if (uchar_bsize > TEXTBUFLEN)
+		buf = palloc(uchar_bsize);
+
+	uchar = (UChar *) buf;
+
+	ulen = uchar_convert(icu_converter, uchar, ulen + 1, src, srclen);
+
+	uiter_setString(&iter, uchar, ulen);
+	state[0] = state[1] = 0;	/* won't need that again */
+	status = U_ZERO_ERROR;
+	result_bsize = ucol_nextSortKeyPart(locale->info.icu.ucol,
+										&iter,
+										state,
+										(uint8_t *) dest,
+										destsize,
+										&status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("sort key generation failed: %s",
+						u_errorName(status))));
+
+	return result_bsize;
+}
+
+static size_t
+pg_strnxfrm_prefix_icu(char *dest, const char *src, size_t srclen,
+					   size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	Assert(locale->provider == COLLPROVIDER_ICU);
+
+	if (GetDatabaseEncoding() == PG_UTF8)
+	{
+		UCharIterator iter;
+		uint32_t	state[2];
+		UErrorCode	status;
+
+		uiter_setUTF8(&iter, src, srclen);
+		state[0] = state[1] = 0;	/* won't need that again */
+		status = U_ZERO_ERROR;
+		result = ucol_nextSortKeyPart(locale->info.icu.ucol,
+									  &iter,
+									  state,
+									  (uint8_t *) dest,
+									  destsize,
+									  &status);
+		if (U_FAILURE(status))
+			ereport(ERROR,
+					(errmsg("sort key generation failed: %s",
+							u_errorName(status))));
+	}
+	else
+		result = pg_strnxfrm_prefix_icu_no_utf8(dest, src, srclen, destsize,
+												locale);
+
+	return result;
+}
+
+static size_t
+pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
+					  pg_locale_t locale)
+{
+	Assert(locale->provider == COLLPROVIDER_ICU);
+	return pg_strnxfrm_prefix_icu(dest, src, -1, destsize, locale);
+}
+
+#endif
+
+/*
+ * Return true if the collation provider supports pg_strxfrm() and
+ * pg_strnxfrm(); otherwise false.
+ *
+ * Unfortunately, it seems that strxfrm() for non-C collations is broken on
+ * many common platforms; testing of multiple versions of glibc reveals that,
+ * for many locales, strcoll() and strxfrm() do not return consistent
+ * results. While no other libc other than Cygwin has so far been shown to
+ * have a problem, we take the conservative course of action for right now and
+ * disable this categorically.  (Users who are certain this isn't a problem on
+ * their system can define TRUST_STRXFRM.)
+ *
+ * No similar problem is known for the ICU provider.
+ */
+bool
+pg_strxfrm_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+#ifdef TRUST_STRXFRM
+		return true;
+#else
+		return false;
+#endif
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ */
+size_t
+pg_strxfrm(char *dest, const char *src, size_t destsize, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strxfrm_libc(dest, src, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm
+ *
+ * Transforms 'src' to a nul-terminated string stored in 'dest' such that
+ * ordinary strcmp() on transformed strings is equivalent to pg_strcoll() on
+ * untransformed strings.
+ *
+ * If destsize is large enough to hold the result, returns the number of bytes
+ * copied to 'dest'; otherwise, returns the number of bytes needed to hold the
+ * result and leaves the contents of 'dest' undefined. If destsize is zero,
+ * 'dest' may be NULL.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm() instead.
+ */
+size_t
+pg_strnxfrm(char *dest, size_t destsize, const char *src, size_t srclen,
+			pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * Return true if the collation provider supports pg_strxfrm_prefix() and
+ * pg_strnxfrm_prefix(); otherwise false.
+ */
+bool
+pg_strxfrm_prefix_enabled(pg_locale_t locale)
+{
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		return false;
+	else if (locale->provider == COLLPROVIDER_ICU)
+		return true;
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+}
+
+/*
+ * pg_strxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ */
+size_t
+pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+				  pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		elog(ERROR, "collprovider '%c' does not support pg_strxfrm_prefix()",
+			 locale->provider);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strxfrm_prefix_icu(dest, src, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+/*
+ * pg_strnxfrm_prefix
+ *
+ * Transforms 'src' to a byte sequence stored in 'dest' such that ordinary
+ * memcmp() on the byte sequence is equivalent to pg_strcoll() on
+ * untransformed strings. The result is not nul-terminated.
+ *
+ * The provided 'src' must be nul-terminated.
+ *
+ * If destsize is not large enough to hold the entire result, stores just the
+ * prefix in 'dest'. Returns the number of bytes actually copied to 'dest'.
+ *
+ * This function may need to nul-terminate the argument for libc functions;
+ * so if the caller already has a nul-terminated string, it should call
+ * pg_strxfrm_prefix() instead.
+ */
+size_t
+pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+				   size_t srclen, pg_locale_t locale)
+{
+	size_t result;
+
+	if (!locale || locale->provider == COLLPROVIDER_LIBC)
+		result = pg_strnxfrm_prefix_libc(dest, src, srclen, destsize, locale);
+#ifdef USE_ICU
+	else if (locale->provider == COLLPROVIDER_ICU)
+		result = pg_strnxfrm_prefix_icu(dest, src, srclen, destsize, locale);
+#endif
+	else
+		/* shouldn't happen */
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
+	return result;
+}
+
+#ifdef USE_ICU
 static void
 init_icu_converter(void)
 {
@@ -1767,6 +2491,39 @@ init_icu_converter(void)
 	icu_converter = conv;
 }
 
+/*
+ * Find length, in UChars, of given string if converted to UChar string.
+ */
+static size_t
+uchar_length(UConverter *converter, const char *str, size_t len)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	ulen = ucnv_toUChars(converter, NULL, 0, str, len, &status);
+	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
+/*
+ * Convert the given source string into a UChar string, stored in dest, and
+ * return the length (in UChars).
+ */
+static int32_t
+uchar_convert(UConverter *converter, UChar *dest, int32_t destlen,
+			  const char *src, size_t srclen)
+{
+	UErrorCode	status = U_ZERO_ERROR;
+	int32_t		ulen;
+	status = U_ZERO_ERROR;
+	ulen = ucnv_toUChars(converter, dest, destlen, src, srclen, &status);
+	if (U_FAILURE(status))
+		ereport(ERROR,
+				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	return ulen;
+}
+
 /*
  * Convert a string in the database encoding into a string of UChars.
  *
@@ -1782,26 +2539,15 @@ init_icu_converter(void)
 int32_t
 icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes)
 {
-	UErrorCode	status;
-	int32_t		len_uchar;
+	int32_t len_uchar;
 
 	init_icu_converter();
 
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, NULL, 0,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_length(icu_converter, buff, nbytes);
 
 	*buff_uchar = palloc((len_uchar + 1) * sizeof(**buff_uchar));
-
-	status = U_ZERO_ERROR;
-	len_uchar = ucnv_toUChars(icu_converter, *buff_uchar, len_uchar + 1,
-							  buff, nbytes, &status);
-	if (U_FAILURE(status))
-		ereport(ERROR,
-				(errmsg("%s failed: %s", "ucnv_toUChars", u_errorName(status))));
+	len_uchar = uchar_convert(icu_converter,
+							  *buff_uchar, len_uchar + 1, buff, nbytes);
 
 	return len_uchar;
 }
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 8ddbae8f51..5bbba94687 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1024,21 +1024,17 @@ hashbpchar(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any(buf, bsize);
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any((uint8_t *) buf, bsize);
 
 			pfree(buf);
 		}
@@ -1086,21 +1082,18 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 #ifdef USE_ICU
 		if (mylocale->provider == COLLPROVIDER_ICU)
 		{
-			int32_t		ulen = -1;
-			UChar	   *uchar = NULL;
-			Size		bsize;
-			uint8_t    *buf;
+			Size		bsize, rsize;
+			char	   *buf;
 
-			ulen = icu_to_uchar(&uchar, keydata, keylen);
-
-			bsize = ucol_getSortKey(mylocale->info.icu.ucol,
-									uchar, ulen, NULL, 0);
+			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
 			buf = palloc(bsize);
-			ucol_getSortKey(mylocale->info.icu.ucol,
-							uchar, ulen, buf, bsize);
-			pfree(uchar);
 
-			result = hash_any_extended(buf, bsize, PG_GETARG_INT64(1));
+			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+			if (rsize != bsize)
+				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+			result = hash_any_extended((uint8_t *) buf, bsize,
+									   PG_GETARG_INT64(1));
 
 			pfree(buf);
 		}
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 170b3a3820..4ca823ca7b 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1553,10 +1553,6 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 	}
 	else
 	{
-		char		a1buf[TEXTBUFLEN];
-		char		a2buf[TEXTBUFLEN];
-		char	   *a1p,
-				   *a2p;
 		pg_locale_t mylocale;
 
 		mylocale = pg_newlocale_from_collation(collid);
@@ -1573,171 +1569,16 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		if (len1 == len2 && memcmp(arg1, arg2, len1) == 0)
 			return 0;
 
-#ifdef WIN32
-		/* Win32 does not have UTF-8, so we need to map to UTF-16 */
-		if (GetDatabaseEncoding() == PG_UTF8
-			&& (!mylocale || mylocale->provider == COLLPROVIDER_LIBC))
-		{
-			int			a1len;
-			int			a2len;
-			int			r;
-
-			if (len1 >= TEXTBUFLEN / 2)
-			{
-				a1len = len1 * 2 + 2;
-				a1p = palloc(a1len);
-			}
-			else
-			{
-				a1len = TEXTBUFLEN;
-				a1p = a1buf;
-			}
-			if (len2 >= TEXTBUFLEN / 2)
-			{
-				a2len = len2 * 2 + 2;
-				a2p = palloc(a2len);
-			}
-			else
-			{
-				a2len = TEXTBUFLEN;
-				a2p = a2buf;
-			}
-
-			/* stupid Microsloth API does not work for zero-length input */
-			if (len1 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg1, len1,
-										(LPWSTR) a1p, a1len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a1p)[r] = 0;
-
-			if (len2 == 0)
-				r = 0;
-			else
-			{
-				r = MultiByteToWideChar(CP_UTF8, 0, arg2, len2,
-										(LPWSTR) a2p, a2len / 2);
-				if (!r)
-					ereport(ERROR,
-							(errmsg("could not convert string to UTF-16: error code %lu",
-									GetLastError())));
-			}
-			((LPWSTR) a2p)[r] = 0;
-
-			errno = 0;
-#ifdef HAVE_LOCALE_T
-			if (mylocale)
-				result = wcscoll_l((LPWSTR) a1p, (LPWSTR) a2p, mylocale->info.lt);
-			else
-#endif
-				result = wcscoll((LPWSTR) a1p, (LPWSTR) a2p);
-			if (result == 2147483647)	/* _NLSCMPERROR; missing from mingw
-										 * headers */
-				ereport(ERROR,
-						(errmsg("could not compare Unicode strings: %m")));
-
-			/* Break tie if necessary. */
-			if (result == 0 &&
-				(!mylocale || mylocale->deterministic))
-			{
-				result = memcmp(arg1, arg2, Min(len1, len2));
-				if ((result == 0) && (len1 != len2))
-					result = (len1 < len2) ? -1 : 1;
-			}
-
-			if (a1p != a1buf)
-				pfree(a1p);
-			if (a2p != a2buf)
-				pfree(a2p);
-
-			return result;
-		}
-#endif							/* WIN32 */
-
-		if (len1 >= TEXTBUFLEN)
-			a1p = (char *) palloc(len1 + 1);
-		else
-			a1p = a1buf;
-		if (len2 >= TEXTBUFLEN)
-			a2p = (char *) palloc(len2 + 1);
-		else
-			a2p = a2buf;
-
-		memcpy(a1p, arg1, len1);
-		a1p[len1] = '\0';
-		memcpy(a2p, arg2, len2);
-		a2p[len2] = '\0';
-
-		if (mylocale)
-		{
-			if (mylocale->provider == COLLPROVIDER_ICU)
-			{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UErrorCode	status;
-
-					status = U_ZERO_ERROR;
-					result = ucol_strcollUTF8(mylocale->info.icu.ucol,
-											  arg1, len1,
-											  arg2, len2,
-											  &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("collation failed: %s", u_errorName(status))));
-				}
-				else
-#endif
-				{
-					int32_t		ulen1,
-								ulen2;
-					UChar	   *uchar1,
-							   *uchar2;
-
-					ulen1 = icu_to_uchar(&uchar1, arg1, len1);
-					ulen2 = icu_to_uchar(&uchar2, arg2, len2);
-
-					result = ucol_strcoll(mylocale->info.icu.ucol,
-										  uchar1, ulen1,
-										  uchar2, ulen2);
-
-					pfree(uchar1);
-					pfree(uchar2);
-				}
-#else							/* not USE_ICU */
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif							/* not USE_ICU */
-			}
-			else
-			{
-#ifdef HAVE_LOCALE_T
-				result = strcoll_l(a1p, a2p, mylocale->info.lt);
-#else
-				/* shouldn't happen */
-				elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
-#endif
-			}
-		}
-		else
-			result = strcoll(a1p, a2p);
+		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
 		if (result == 0 &&
 			(!mylocale || mylocale->deterministic))
-			result = strcmp(a1p, a2p);
-
-		if (a1p != a1buf)
-			pfree(a1p);
-		if (a2p != a2buf)
-			pfree(a2p);
+		{
+			result = memcmp(arg1, arg2, Min(len1, len2));
+			if ((result == 0) && (len1 != len2))
+				result = (len1 < len2) ? -1 : 1;
+		}
 	}
 
 	return result;
@@ -2073,20 +1914,6 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 		 */
 		locale = pg_newlocale_from_collation(collid);
 
-		/*
-		 * There is a further exception on Windows.  When the database
-		 * encoding is UTF-8 and we are not using the C collation, complex
-		 * hacks are required.  We don't currently have a comparator that
-		 * handles that case, so we fall back on the slow method of having the
-		 * sort code invoke bttextcmp() (in the case of text) via the fmgr
-		 * trampoline.  ICU locales work just the same on Windows, however.
-		 */
-#ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8 &&
-			!(locale && locale->provider == COLLPROVIDER_ICU))
-			return;
-#endif
-
 		/*
 		 * We use varlenafastcmp_locale except for type NAME.
 		 */
@@ -2102,13 +1929,7 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 
 	/*
 	 * Unfortunately, it seems that abbreviation for non-C collations is
-	 * broken on many common platforms; testing of multiple versions of glibc
-	 * reveals that, for many locales, strcoll() and strxfrm() do not return
-	 * consistent results, which is fatal to this optimization.  While no
-	 * other libc other than Cygwin has so far been shown to have a problem,
-	 * we take the conservative course of action for right now and disable
-	 * this categorically.  (Users who are certain this isn't a problem on
-	 * their system can define TRUST_STRXFRM.)
+	 * broken on many common platforms; see pg_strxfrm_enabled().
 	 *
 	 * Even apart from the risk of broken locales, it's possible that there
 	 * are platforms where the use of abbreviated keys should be disabled at
@@ -2121,10 +1942,8 @@ varstr_sortsupport(SortSupport ssup, Oid typid, Oid collid)
 	 * categorically, we may still want or need to disable it for particular
 	 * platforms.
 	 */
-#ifndef TRUST_STRXFRM
-	if (!collate_c && !(locale && locale->provider == COLLPROVIDER_ICU))
+	if (!collate_c && !pg_strxfrm_enabled(locale))
 		abbreviate = false;
-#endif
 
 	/*
 	 * If we're using abbreviated keys, or if we're using a locale-aware
@@ -2395,60 +2214,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 		return sss->last_returned;
 	}
 
-	if (sss->locale)
-	{
-		if (sss->locale->provider == COLLPROVIDER_ICU)
-		{
-#ifdef USE_ICU
-#ifdef HAVE_UCOL_STRCOLLUTF8
-			if (GetDatabaseEncoding() == PG_UTF8)
-			{
-				UErrorCode	status;
-
-				status = U_ZERO_ERROR;
-				result = ucol_strcollUTF8(sss->locale->info.icu.ucol,
-										  a1p, len1,
-										  a2p, len2,
-										  &status);
-				if (U_FAILURE(status))
-					ereport(ERROR,
-							(errmsg("collation failed: %s", u_errorName(status))));
-			}
-			else
-#endif
-			{
-				int32_t		ulen1,
-							ulen2;
-				UChar	   *uchar1,
-						   *uchar2;
-
-				ulen1 = icu_to_uchar(&uchar1, a1p, len1);
-				ulen2 = icu_to_uchar(&uchar2, a2p, len2);
-
-				result = ucol_strcoll(sss->locale->info.icu.ucol,
-									  uchar1, ulen1,
-									  uchar2, ulen2);
-
-				pfree(uchar1);
-				pfree(uchar2);
-			}
-#else							/* not USE_ICU */
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif							/* not USE_ICU */
-		}
-		else
-		{
-#ifdef HAVE_LOCALE_T
-			result = strcoll_l(sss->buf1, sss->buf2, sss->locale->info.lt);
-#else
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", sss->locale->provider);
-#endif
-		}
-	}
-	else
-		result = strcoll(sss->buf1, sss->buf2);
+	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
 	if (result == 0 &&
@@ -2471,6 +2237,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 static Datum
 varstr_abbrev_convert(Datum original, SortSupport ssup)
 {
+	const size_t max_prefix_bytes = sizeof(Datum);
 	VarStringSortSupport *sss = (VarStringSortSupport *) ssup->ssup_extra;
 	VarString  *authoritative = DatumGetVarStringPP(original);
 	char	   *authoritative_data = VARDATA_ANY(authoritative);
@@ -2483,7 +2250,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 
 	pres = (char *) &res;
 	/* memset(), so any non-overwritten bytes are NUL */
-	memset(pres, 0, sizeof(Datum));
+	memset(pres, 0, max_prefix_bytes);
 	len = VARSIZE_ANY_EXHDR(authoritative);
 
 	/* Get number of bytes, ignoring trailing spaces */
@@ -2518,14 +2285,10 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 	 * thing: explicitly consider string length.
 	 */
 	if (sss->collate_c)
-		memcpy(pres, authoritative_data, Min(len, sizeof(Datum)));
+		memcpy(pres, authoritative_data, Min(len, max_prefix_bytes));
 	else
 	{
 		Size		bsize;
-#ifdef USE_ICU
-		int32_t		ulen = -1;
-		UChar	   *uchar = NULL;
-#endif
 
 		/*
 		 * We're not using the C collation, so fall back on strxfrm or ICU
@@ -2543,7 +2306,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		if (sss->last_len1 == len && sss->cache_blob &&
 			memcmp(sss->buf1, authoritative_data, len) == 0)
 		{
-			memcpy(pres, sss->buf2, Min(sizeof(Datum), sss->last_len2));
+			memcpy(pres, sss->buf2, Min(max_prefix_bytes, sss->last_len2));
 			/* No change affecting cardinality, so no hashing required */
 			goto done;
 		}
@@ -2551,81 +2314,49 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		memcpy(sss->buf1, authoritative_data, len);
 
 		/*
-		 * Just like strcoll(), strxfrm() expects a NUL-terminated string. Not
-		 * necessary for ICU, but doesn't hurt.
+		 * pg_strxfrm() and pg_strxfrm_prefix expect NUL-terminated
+		 * strings.
 		 */
 		sss->buf1[len] = '\0';
 		sss->last_len1 = len;
 
-#ifdef USE_ICU
-		/* When using ICU and not UTF8, convert string to UChar. */
-		if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU &&
-			GetDatabaseEncoding() != PG_UTF8)
-			ulen = icu_to_uchar(&uchar, sss->buf1, len);
-#endif
-
-		/*
-		 * Loop: Call strxfrm() or ucol_getSortKey(), possibly enlarge buffer,
-		 * and try again.  Both of these functions have the result buffer
-		 * content undefined if the result did not fit, so we need to retry
-		 * until everything fits, even though we only need the first few bytes
-		 * in the end.  When using ucol_nextSortKeyPart(), however, we only
-		 * ask for as many bytes as we actually need.
-		 */
-		for (;;)
+		if (pg_strxfrm_prefix_enabled(sss->locale))
 		{
-#ifdef USE_ICU
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_ICU)
+			if (sss->buflen2 < max_prefix_bytes)
 			{
-				/*
-				 * When using UTF8, use the iteration interface so we only
-				 * need to produce as many bytes as we actually need.
-				 */
-				if (GetDatabaseEncoding() == PG_UTF8)
-				{
-					UCharIterator iter;
-					uint32_t	state[2];
-					UErrorCode	status;
-
-					uiter_setUTF8(&iter, sss->buf1, len);
-					state[0] = state[1] = 0;	/* won't need that again */
-					status = U_ZERO_ERROR;
-					bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
-												 &iter,
-												 state,
-												 (uint8_t *) sss->buf2,
-												 Min(sizeof(Datum), sss->buflen2),
-												 &status);
-					if (U_FAILURE(status))
-						ereport(ERROR,
-								(errmsg("sort key generation failed: %s",
-										u_errorName(status))));
-				}
-				else
-					bsize = ucol_getSortKey(sss->locale->info.icu.ucol,
-											uchar, ulen,
-											(uint8_t *) sss->buf2, sss->buflen2);
+				sss->buflen2 = Max(max_prefix_bytes,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
 			}
-			else
-#endif
-#ifdef HAVE_LOCALE_T
-			if (sss->locale && sss->locale->provider == COLLPROVIDER_LIBC)
-				bsize = strxfrm_l(sss->buf2, sss->buf1,
-								  sss->buflen2, sss->locale->info.lt);
-			else
-#endif
-				bsize = strxfrm(sss->buf2, sss->buf1, sss->buflen2);
-
-			sss->last_len2 = bsize;
-			if (bsize < sss->buflen2)
-				break;
 
+			bsize = pg_strxfrm_prefix(sss->buf2, sss->buf1,
+									  max_prefix_bytes, sss->locale);
+		}
+		else
+		{
 			/*
-			 * Grow buffer and retry.
+			 * Loop: Call pg_strxfrm(), possibly enlarge buffer, and try
+			 * again.  The pg_strxfrm() function leaves the result buffer
+			 * content undefined if the result did not fit, so we need to
+			 * retry until everything fits, even though we only need the first
+			 * few bytes in the end.
 			 */
-			sss->buflen2 = Max(bsize + 1,
-							   Min(sss->buflen2 * 2, MaxAllocSize));
-			sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			for (;;)
+			{
+				bsize = pg_strxfrm(sss->buf2, sss->buf1, sss->buflen2,
+								   sss->locale);
+
+				sss->last_len2 = bsize;
+				if (bsize < sss->buflen2)
+					break;
+
+				/*
+				 * Grow buffer and retry.
+				 */
+				sss->buflen2 = Max(bsize + 1,
+								   Min(sss->buflen2 * 2, MaxAllocSize));
+				sss->buf2 = repalloc(sss->buf2, sss->buflen2);
+			}
 		}
 
 		/*
@@ -2637,12 +2368,7 @@ varstr_abbrev_convert(Datum original, SortSupport ssup)
 		 * (Actually, even if there were NUL bytes in the blob it would be
 		 * okay.  See remarks on bytea case above.)
 		 */
-		memcpy(pres, sss->buf2, Min(sizeof(Datum), bsize));
-
-#ifdef USE_ICU
-		if (uchar)
-			pfree(uchar);
-#endif
+		memcpy(pres, sss->buf2, Min(max_prefix_bytes, bsize));
 	}
 
 	/*
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index cede43440b..def2b55f94 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -100,6 +100,19 @@ extern void make_icu_collator(const char *iculocstr,
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
+extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
+extern int pg_strncoll(const char *arg1, size_t len1,
+					   const char *arg2, size_t len2, pg_locale_t locale);
+extern bool pg_strxfrm_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm(char *dest, const char *src, size_t destsize,
+						 pg_locale_t locale);
+extern size_t pg_strnxfrm(char *dest, size_t destsize, const char *src,
+						  size_t srclen, pg_locale_t locale);
+extern bool pg_strxfrm_prefix_enabled(pg_locale_t locale);
+extern size_t pg_strxfrm_prefix(char *dest, const char *src, size_t destsize,
+								pg_locale_t locale);
+extern size_t pg_strnxfrm_prefix(char *dest, size_t destsize, const char *src,
+								 size_t srclen, pg_locale_t locale);
 
 #ifdef USE_ICU
 extern int32_t icu_to_uchar(UChar **buff_uchar, const char *buff, size_t nbytes);
-- 
2.34.1

v10-0002-Introduce-pg_locale_deterministic.patchtext/x-patch; charset=UTF-8; name=v10-0002-Introduce-pg_locale_deterministic.patchDownload

From 8a37ce1056299438b83075c73741b41fdced5a67 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 13 Feb 2023 14:09:02 -0800
Subject: [PATCH v10 2/4] Introduce pg_locale_deterministic().

Avoids the need of callers to test for NULL, and also avoids the need
to access the pg_locale_t structure directly.
---
 src/backend/access/hash/hashfunc.c |  4 ++--
 src/backend/regex/regc_pg_locale.c |  2 +-
 src/backend/utils/adt/like.c       |  4 ++--
 src/backend/utils/adt/pg_locale.c  |  9 +++++++++
 src/backend/utils/adt/varchar.c    |  8 ++++----
 src/backend/utils/adt/varlena.c    | 14 ++++++--------
 src/include/utils/pg_locale.h      |  1 +
 7 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index c0ed995919..9f9ab8aa93 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -282,7 +282,7 @@ hashtext(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) VARDATA_ANY(key),
 						  VARSIZE_ANY_EXHDR(key));
@@ -337,7 +337,7 @@ hashtextextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) VARDATA_ANY(key),
 								   VARSIZE_ANY_EXHDR(key),
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 323f00516d..00ce735fdd 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -259,7 +259,7 @@ pg_set_regex_collation(Oid collation)
 		 */
 		pg_regex_locale = pg_newlocale_from_collation(collation);
 
-		if (pg_regex_locale && !pg_regex_locale->deterministic)
+		if (!pg_locale_deterministic(pg_regex_locale))
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("nondeterministic collations are not supported for regular expressions")));
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index fc6cb7f5b7..33a2f46aab 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -155,7 +155,7 @@ GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation
 	{
 		pg_locale_t locale = pg_newlocale_from_collation(collation);
 
-		if (locale && !locale->deterministic)
+		if (!pg_locale_deterministic(locale))
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("nondeterministic collations are not supported for LIKE")));
@@ -196,7 +196,7 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
 	else
 		locale = pg_newlocale_from_collation(collation);
 
-	if (locale && !locale->deterministic)
+	if (!pg_locale_deterministic(locale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for ILIKE")));
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 0722b1ca3b..c4b7c65088 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1481,6 +1481,15 @@ report_newlocale_failure(const char *localename)
 }
 #endif							/* HAVE_LOCALE_T */
 
+bool
+pg_locale_deterministic(pg_locale_t locale)
+{
+	/* default locale must always be deterministic */
+	if (locale == NULL)
+		return true;
+	else
+		return locale->deterministic;
+}
 
 /*
  * Create a locale_t from a collation OID.  Results are cached for the
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index 5bbba94687..d23c5c99f3 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -762,7 +762,7 @@ bpchareq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -807,7 +807,7 @@ bpcharne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		/*
 		 * Since we only care about equality or not-equality, we can avoid all
@@ -1015,7 +1015,7 @@ hashbpchar(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any((unsigned char *) keydata, keylen);
 	}
@@ -1072,7 +1072,7 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (!mylocale || mylocale->deterministic)
+	if (pg_locale_deterministic(mylocale))
 	{
 		result = hash_any_extended((unsigned char *) keydata, keylen,
 								   PG_GETARG_INT64(1));
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 4ca823ca7b..5778e3f0ef 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1221,7 +1221,7 @@ text_position_setup(text *t1, text *t2, Oid collid, TextPositionState *state)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -1572,8 +1572,7 @@ varstr_cmp(const char *arg1, int len1, const char *arg2, int len2, Oid collid)
 		result = pg_strncoll(arg1, len1, arg2, len2, mylocale);
 
 		/* Break tie if necessary. */
-		if (result == 0 &&
-			(!mylocale || mylocale->deterministic))
+		if (result == 0 && pg_locale_deterministic(mylocale))
 		{
 			result = memcmp(arg1, arg2, Min(len1, len2));
 			if ((result == 0) && (len1 != len2))
@@ -1628,7 +1627,7 @@ texteq(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1687,7 +1686,7 @@ textne(PG_FUNCTION_ARGS)
 	else
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (locale_is_c || !mylocale || mylocale->deterministic)
+	if (locale_is_c || pg_locale_deterministic(mylocale))
 	{
 		Datum		arg1 = PG_GETARG_DATUM(0);
 		Datum		arg2 = PG_GETARG_DATUM(1);
@@ -1801,7 +1800,7 @@ text_starts_with(PG_FUNCTION_ARGS)
 	if (!lc_collate_is_c(collid))
 		mylocale = pg_newlocale_from_collation(collid);
 
-	if (mylocale && !mylocale->deterministic)
+	if (!pg_locale_deterministic(mylocale))
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("nondeterministic collations are not supported for substring searches")));
@@ -2217,8 +2216,7 @@ varstrfastcmp_locale(char *a1p, int len1, char *a2p, int len2, SortSupport ssup)
 	result = pg_strcoll(sss->buf1, sss->buf2, sss->locale);
 
 	/* Break tie if necessary. */
-	if (result == 0 &&
-		(!sss->locale || sss->locale->deterministic))
+	if (result == 0 && pg_locale_deterministic(sss->locale))
 		result = strcmp(sss->buf1, sss->buf2);
 
 	/* Cache result, perhaps saving an expensive strcoll() call next time */
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index def2b55f94..b8f22875a8 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -97,6 +97,7 @@ extern PGDLLIMPORT struct pg_locale_struct default_locale;
 extern void make_icu_collator(const char *iculocstr,
 							  struct pg_locale_struct *resultp);
 
+extern bool pg_locale_deterministic(pg_locale_t locale);
 extern pg_locale_t pg_newlocale_from_collation(Oid collid);
 
 extern char *get_collation_actual_version(char collprovider, const char *collcollate);
-- 
2.34.1

v10-0003-Remove-unnecessary-ifdef-USE_ICU.patchtext/x-patch; charset=UTF-8; name=v10-0003-Remove-unnecessary-ifdef-USE_ICU.patchDownload

From 4b2265136583c3e9c451168ef6075315700901e6 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 13 Feb 2023 14:14:04 -0800
Subject: [PATCH v10 3/4] Remove unnecessary #ifdef USE_ICU.

Now that the provider-independent API pg_strnxfrm() is available, we
no longer need the special cases for ICU in hashfunc.c and varchar.c.
---
 src/backend/access/hash/hashfunc.c | 78 ++++++++++++------------------
 src/backend/utils/adt/varchar.c    | 54 ++++++++-------------
 2 files changed, 50 insertions(+), 82 deletions(-)

diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c
index 9f9ab8aa93..7cbd39f466 100644
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -289,29 +289,21 @@ hashtext(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any((uint8_t *) buf, bsize);
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any((uint8_t *) buf, bsize);
+
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -345,30 +337,22 @@ hashtextextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
-			const char *keydata = VARDATA_ANY(key);
-			size_t		keylen = VARSIZE_ANY_EXHDR(key);
-
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
-
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
-
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
-
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		Size		bsize, rsize;
+		char	   *buf;
+		const char *keydata = VARDATA_ANY(key);
+		size_t		keylen = VARSIZE_ANY_EXHDR(key);
+
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
+
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
+
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
+
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c
index d23c5c99f3..8dc1faaaf2 100644
--- a/src/backend/utils/adt/varchar.c
+++ b/src/backend/utils/adt/varchar.c
@@ -1021,27 +1021,19 @@ hashbpchar(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any((uint8_t *) buf, bsize);
+		result = hash_any((uint8_t *) buf, bsize);
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	/* Avoid leaking memory for toasted inputs */
@@ -1079,28 +1071,20 @@ hashbpcharextended(PG_FUNCTION_ARGS)
 	}
 	else
 	{
-#ifdef USE_ICU
-		if (mylocale->provider == COLLPROVIDER_ICU)
-		{
-			Size		bsize, rsize;
-			char	   *buf;
+		Size		bsize, rsize;
+		char	   *buf;
 
-			bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
-			buf = palloc(bsize);
+		bsize = pg_strnxfrm(NULL, 0, keydata, keylen, mylocale);
+		buf = palloc(bsize);
 
-			rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
-			if (rsize != bsize)
-				elog(ERROR, "pg_strnxfrm() returned unexpected result");
+		rsize = pg_strnxfrm(buf, bsize, keydata, keylen, mylocale);
+		if (rsize != bsize)
+			elog(ERROR, "pg_strnxfrm() returned unexpected result");
 
-			result = hash_any_extended((uint8_t *) buf, bsize,
-									   PG_GETARG_INT64(1));
+		result = hash_any_extended((uint8_t *) buf, bsize,
+								   PG_GETARG_INT64(1));
 
-			pfree(buf);
-		}
-		else
-#endif
-			/* shouldn't happen */
-			elog(ERROR, "unsupported collprovider: %c", mylocale->provider);
+		pfree(buf);
 	}
 
 	PG_FREE_IF_COPY(key, 0);
-- 
2.34.1

v10-0004-Introduce-GUCs-to-control-abbreviated-keys-sort-.patchtext/x-patch; charset=UTF-8; name=v10-0004-Introduce-GUCs-to-control-abbreviated-keys-sort-.patchDownload

From 95cb115bbb21fd910c53eee4138d45ec2b8a9794 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Sat, 21 Jan 2023 12:44:07 -0800
Subject: [PATCH v10 4/4] Introduce GUCs to control abbreviated keys sort
 optimization.

The setting sort_abbreviated_keys turns the optimization on or off
overall. The optimization relies on collation providers, which are
complex dependencies, and the performance of the optimization may rely
on many factors. Introducing a GUC allows easier diagnosis when this
optimization results in worse perforamnce.

The setting trust_strxfrm replaces the define TRUST_STRXFRM, allowing
users to experiment with the abbreviated keys optimization when using
the libc provider. Previously, the optimization only applied to
collations using the ICU provider unless specially compiled. By
default, allowed only for superusers (because an incorrect setting
could lead to wrong results), but can be granted to others.
---
 doc/src/sgml/config.sgml                   | 40 ++++++++++++++++++++++
 src/backend/utils/adt/pg_locale.c          | 20 +++++------
 src/backend/utils/misc/guc_tables.c        | 24 +++++++++++++
 src/backend/utils/sort/tuplesortvariants.c | 17 ++++++---
 4 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8c56b134a8..f59a8518a0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -11252,6 +11252,46 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-sort-abbreviated-keys" xreflabel="sort_abbreviated_keys">
+      <term><varname>sort_abbreviated_keys</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>sort_abbreviated_keys</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the use of abbreviated sort keys, a sort optimization,
+        if applicable. The default is <literal>true</literal>. Disabling may
+        be useful to diagnose problems or measure performance.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-trust-strxfrm" xreflabel="trust_strxfrm">
+      <term><varname>trust_strxfrm</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>trust_strxfrm</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Abbreviated keys, a sort optimization, depends on correct behavior of
+        the operating system function <function>strxfrm()</function> when
+        using a collation with the <literal>libc</literal> provider. On some
+        platforms <function>strxfrm()</function> does not return results
+        consistent with <function>strcoll()</function>, which means the
+        optimization could return wrong results. Set to
+        <literal>true</literal> if it is certain that
+        <function>strxfrm()</function> can be trusted.
+       </para>
+       <para>
+        The default value is <literal>false</literal>. This setting has no
+        effect if <xref linkend="guc-sort-abbreviated-keys"/> is set to
+        <literal>false</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-trace-locks" xreflabel="trace_locks">
       <term><varname>trace_locks</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c4b7c65088..d88f02aeb1 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -94,6 +94,9 @@ char	   *locale_monetary;
 char	   *locale_numeric;
 char	   *locale_time;
 
+/* GUC to enable use of strxfrm() for abbreviated keys */
+bool		trust_strxfrm = false;
+
 /*
  * lc_time localization cache.
  *
@@ -2105,17 +2108,16 @@ pg_strxfrm_libc(char *dest, const char *src, size_t destsize,
 {
 	Assert(!locale || locale->provider == COLLPROVIDER_LIBC);
 
-#ifdef TRUST_STRXFRM
+	/* shouldn't happen */
+	if (!trust_strxfrm)
+		elog(ERROR, "unsupported collprovider: %c", locale->provider);
+
 #ifdef HAVE_LOCALE_T
 	if (locale)
 		return strxfrm_l(dest, src, destsize, locale->info.lt);
 	else
 #endif
 		return strxfrm(dest, src, destsize);
-#else
-	/* shouldn't happen */
-	elog(ERROR, "unsupported collprovider: %c", locale->provider);
-#endif
 }
 
 static size_t
@@ -2302,7 +2304,7 @@ pg_strxfrm_prefix_icu(char *dest, const char *src, size_t destsize,
  * results. While no other libc other than Cygwin has so far been shown to
  * have a problem, we take the conservative course of action for right now and
  * disable this categorically.  (Users who are certain this isn't a problem on
- * their system can define TRUST_STRXFRM.)
+ * their system can set the developer GUC "trust_strxfrm".)
  *
  * No similar problem is known for the ICU provider.
  */
@@ -2310,11 +2312,7 @@ bool
 pg_strxfrm_enabled(pg_locale_t locale)
 {
 	if (!locale || locale->provider == COLLPROVIDER_LIBC)
-#ifdef TRUST_STRXFRM
-		return true;
-#else
-		return false;
-#endif
+		return trust_strxfrm;
 	else if (locale->provider == COLLPROVIDER_ICU)
 		return true;
 	else
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..a1e7eb7b99 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -102,6 +102,8 @@ extern bool trace_syncscan;
 #ifdef DEBUG_BOUNDED_SORT
 extern bool optimize_bounded_sort;
 #endif
+extern bool sort_abbreviated_keys;
+extern bool trust_strxfrm;
 
 /*
  * Options for enum values defined in this module.
@@ -1673,6 +1675,28 @@ struct config_bool ConfigureNamesBool[] =
 	},
 #endif
 
+	{
+		{"sort_abbreviated_keys", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Enables the use of abbreviated sort keys."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_EXPLAIN
+		},
+		&sort_abbreviated_keys,
+		true,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"trust_strxfrm", PGC_SUSET, DEVELOPER_OPTIONS,
+		 gettext_noop("Allow use of strxfrm() for abbreviated keys optimization for libc provider."),
+		 NULL,
+		 GUC_NOT_IN_SAMPLE
+		},
+		&trust_strxfrm,
+		false,
+		NULL, NULL, NULL
+	},
+
 #ifdef WAL_DEBUG
 	{
 		{"wal_debug", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..ba16779f97 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -37,6 +37,8 @@
 #define DATUM_SORT		2
 #define CLUSTER_SORT	3
 
+bool sort_abbreviated_keys = true;
+
 static void removeabbrev_heap(Tuplesortstate *state, SortTuple *stups,
 							  int count);
 static void removeabbrev_cluster(Tuplesortstate *state, SortTuple *stups,
@@ -185,7 +187,8 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		sortKey->ssup_nulls_first = nullsFirstFlags[i];
 		sortKey->ssup_attno = attNums[i];
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
@@ -295,7 +298,8 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
 		sortKey->ssup_attno = scanKey->sk_attno;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -379,7 +383,8 @@ tuplesort_begin_index_btree(Relation heapRel,
 			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
 		sortKey->ssup_attno = scanKey->sk_attno;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -499,7 +504,8 @@ tuplesort_begin_index_gist(Relation heapRel,
 		sortKey->ssup_nulls_first = false;
 		sortKey->ssup_attno = i + 1;
 		/* Convey if abbreviation optimization is applicable in principle */
-		sortKey->abbreviate = (i == 0 && base->haveDatum1);
+		if (sort_abbreviated_keys)
+			sortKey->abbreviate = (i == 0 && base->haveDatum1);
 
 		Assert(sortKey->ssup_attno != 0);
 
@@ -573,7 +579,8 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	 * can't, because a datum sort only stores a single copy of the datum; the
 	 * "tuple" field of each SortTuple is NULL.
 	 */
-	base->sortKeys->abbreviate = !typbyval;
+	if (sort_abbreviated_keys)
+		base->sortKeys->abbreviate = !typbyval;
 
 	PrepareSortSupportFromOrderingOp(sortOperator, base->sortKeys);
 
-- 
2.34.1

#22

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#21)

Re: Rework of collation code, extensibility

On Mon, 2023-02-13 at 15:45 -0800, Jeff Davis wrote:

New version attached. Changes:

These patches, especially 0001, have been around for a while, and
they've received some review attention with no outstanding TODOs that
I'm aware of.

I plan to commit v10 (or something close to it) soon unless someone has
additional feedback.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#23

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#21)

Re: Rework of collation code, extensibility

On 14.02.23 00:45, Jeff Davis wrote:

Now the patches are:

0001: pg_strcoll/pg_strxfrm
0002: pg_locale_deterministic()
0003: cleanup a USE_ICU special case
0004: GUCs (only for testing, not for commit)

I haven't read the whole thing again, but this arrangement looks good to
me. I don't have an opinion on whether 0004 is actually useful.

#24

Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#23)

Re: Rework of collation code, extensibility

On Wed, 2023-02-22 at 20:49 +0100, Peter Eisentraut wrote:

On 14.02.23 00:45, Jeff Davis wrote:

Now the patches are:

     0001: pg_strcoll/pg_strxfrm
     0002: pg_locale_deterministic()
     0003: cleanup a USE_ICU special case
     0004: GUCs (only for testing, not for commit)

I haven't read the whole thing again, but this arrangement looks
good > to
me. I don't have an opinion on whether 0004 is actually useful.

Committed with a few revisions after I took a fresh look over the
patch.

The most significant was that I found out that we are also hashing the
NUL byte at the end of the string when the collation is non-
deterministic. The refactoring patch doesn't change that of course, but
the API from pg_strnxfrm() is more clear and I added comments.

Also, ICU uses int32_t for string lengths rather than size_t (I'm not
sure that's a great idea, but that's what ICU does). I clarified the
boundary by changing the argument types of the ICU-specific static
functions to int32_t, while leaving the API entry points as size_t.

--
Jeff Davis
PostgreSQL Contributor Team - AWS