Move defaults toward ICU in 16?

Started by Jeff Davisalmost 3 years ago53 messages

pgsql@j-davis.com

almost 3 years ago

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

If so, is version 16 the right time to adjust defaults to favor ICU?

* At build time, default to --with-icu (-Dicu=enabled); users who
don't want ICU can specify --without-icu (-Dicu=disabled/auto)
* At initdb time, default to --locale-provider=icu if built with
ICU support

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Regards,
Jeff Davis

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#1)

Re: Move defaults toward ICU in 16?

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

--
Robert Haas
EDB: http://www.enterprisedb.com

pgsql@j-davis.com

almost 3 years ago

In reply to: Robert Haas (#2)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

Yes:

* ICU more featureful: e.g. supports case-insensitive collations (the
citext docs suggest looking at ICU instead).
* It's faster: a simple non-contrived sort is something like 70%
faster[1]/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com than one using glibc.
* It can provide consistent semantics across platforms.

I believe the above reasons are enough to call ICU "better", but it
also seems like a better path for addressing/mitigating collator
versioning problems:

* Easier for users to control what library version is available on
their system. We can also ask packagers to keep some old versions of
ICU available for an extended period of time.
* If one of the ICU multilib patches makes it in, it will be easier
for users to select which of the library versions Postgres will use.
* Reports versions for indiividual collators, distinct from the
library version.

The biggest disadvantage (rather, the flip side of its advantages) is
that it's a separate dependency. Will ICU still be maintained in 10
years or will we end up stuck maintaining it ourselves? Then again,
we've already been shipping it, so I don't know if we can avoid that
problem entirely now even if we wanted to.

I don't mean that ICU solves all of our problems -- far from it. But
you asked if I think it's better, and my answer is yes.

Regards,
Jeff Davis

[1]: /messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com

thomas.munro@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#3)

Re: Move defaults toward ICU in 16?

On Fri, Feb 3, 2023 at 5:31 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-02-02 at 08:44 -0500, Robert Haas wrote:

On Thu, Feb 2, 2023 at 8:13 AM Jeff Davis <pgsql@j-davis.com> wrote:

If we don't want to nudge users toward ICU, is it because we are
waiting for something, or is there a lack of consensus that ICU is
actually better?

Do you think it's better?

Yes:

* ICU more featureful: e.g. supports case-insensitive collations (the
citext docs suggest looking at ICU instead).
* It's faster: a simple non-contrived sort is something like 70%
faster[1] than one using glibc.
* It can provide consistent semantics across platforms.

+1

* Easier for users to control what library version is available on
their system. We can also ask packagers to keep some old versions of
ICU available for an extended period of time.
* If one of the ICU multilib patches makes it in, it will be easier
for users to select which of the library versions Postgres will use.
* Reports versions for indiividual collators, distinct from the
library version.

+1

The biggest disadvantage (rather, the flip side of its advantages) is
that it's a separate dependency. Will ICU still be maintained in 10
years or will we end up stuck maintaining it ourselves? Then again,
we've already been shipping it, so I don't know if we can avoid that
problem entirely now even if we wanted to.

It has a pretty special status, with an absolutely enormous amount of
technology depending on it.

http://blog.unicode.org/2016/05/icu-joins-unicode-consortium.html
https://unicode.org/consortium/consort.html
https://home.unicode.org/membership/members/
https://home.unicode.org/about-unicode/

I mean, who knows what the future holds, but ultimately what we're
doing here is taking the de facto reference implementation of the
Unicode collation algorithm. Are Unicode and the consortium still
going to be here in 10 years? We're all in on Unicode, and it's also
tangled up with ISO standards, as are parts of the collation stuff.
Sure, there could be a clean-room implementation that replaces it in
some sense (just as there is a Java implementation) but it would very
likely be "the same" because the real thing we're buying here is the
set of algorithms and data maintenance that the whole industry has
agreed on.

Unless Britain decides to exit the Latin alphabet, terminate
membership of ISO and revert to anglo-saxon runes with a sort order
that is defined in the new constitution as "the opposite of whatever
Unicode says", it's hard to see obstacles to ICU's long term universal
applicability.

It's still important to have libc support as an option, though,
because it's a totally reasonable thing to want sort order to agree
with the "sort" command on the same host, and you are willing to deal
with all the complexities that we're trying to escape.

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Thomas Munro (#4)

Re: Move defaults toward ICU in 16?

On Thu, Feb 2, 2023 at 2:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Sure, there could be a clean-room implementation that replaces it in
some sense (just as there is a Java implementation) but it would very
likely be "the same" because the real thing we're buying here is the
set of algorithms and data maintenance that the whole industry has
agreed on.

I don't think that a clean room implementation is implausible. They
seem to already exist, and be explicitly provided for by CLDR, which
is not joined at the hip to ICU:

https://github.com/elixir-cldr/cldr

Most of the value that we tend to think of as coming from ICU actually
comes from CLDR itself, as well as related Unicode Consortium and IETF
standards/RFCs such as BCP-47.

Unless Britain decides to exit the Latin alphabet, terminate
membership of ISO and revert to anglo-saxon runes with a sort order
that is defined in the new constitution as "the opposite of whatever
Unicode says", it's hard to see obstacles to ICU's long term universal
applicability.

It would have to literally be defined as "not unicode" for it to
present a real problem. A key goal of Unicode is to accommodate
political and cultural shifts, since even countries can come and go.
In principle Unicode should be able to accommodate just about any
change in preferences, except when there is an irreconcilable
difference of opinion among people that are from the same natural
language group. For example it can accommodate relatively minor
differences of opinion about how text should be sorted among groups
that each speak a regional dialect of the same language. Hardly
anybody even notices this.

Accommodating these variations can only come from making a huge
investment. Most of the work is actually done by natural language
scholars, not technologists. That effort is very unlikely to be
duplicated by some other group with its own conflicting goals. AFAICT
there is no great need for any schisms, since differences of opinion
can usually be accommodated under the umbrella of Unicode.

--
Peter Geoghegan

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#4)

Re: Move defaults toward ICU in 16?

Thomas Munro <thomas.munro@gmail.com> writes:

It's still important to have libc support as an option, though,
because it's a totally reasonable thing to want sort order to agree
with the "sort" command on the same host, and you are willing to deal
with all the complexities that we're trying to escape.

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

regards, tom lane

pgsql@j-davis.com

almost 3 years ago

In reply to: Tom Lane (#6)

1 attachment(s)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

To get the default locale, the patch initializes a UCollator with NULL
for the locale name, and then queries it for the locale name. Then it's
converted to a language tag, which is consistent with the initial
collation import. I'm not sure that's the best way, but it seems
reasonable.

If it's a user-provided locale (--icu-locale=), then the patch leaves
it as-is, and does not convert it to a language tag (consistent with
CREATE COLLATION and CREATE DATABASE).

I opened another discussion about whether we want to try harder to
validate or canonicalize the locale name:

/messages/by-id/11b1eeb7e7667fdd4178497aeb796c48d26e69b9.camel@j-davis.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-Use-ICU-by-default-at-initdb-time.patchtext/x-patch; charset=UTF-8; name=v1-0001-Use-ICU-by-default-at-initdb-time.patchDownload

From 1b7d940c0f12062185b8b42bf8d3c0a6f05a74d4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 8 Feb 2023 12:06:26 -0800
Subject: [PATCH v1] Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 src/bin/initdb/initdb.c | 74 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..7321652db3 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -53,6 +53,7 @@
 #include <netdb.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
+#include <unicode/ucol.h>
 #include <unistd.h>
 #include <signal.h>
 #include <time.h>
@@ -133,7 +134,11 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+#ifdef USE_ICU
+static char locale_provider = COLLPROVIDER_ICU;
+#else
 static char locale_provider = COLLPROVIDER_LIBC;
+#endif
 static char *icu_locale = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
@@ -2024,6 +2029,72 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Check that ICU accepts the locale name; or if not specified, retrieve the
+ * default ICU locale.
+ */
+static void
+check_icu_locale()
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+	{
+		if (icu_locale)
+			pg_fatal("ICU locale \"%s\" could not be opened: %s",
+					 icu_locale, u_errorName(status));
+		else
+			pg_fatal("default ICU locale could not be opened: %s",
+					 u_errorName(status));
+	}
+
+	/* if not specified, get locale from default collator */
+	if (icu_locale == NULL)
+	{
+		const char	*default_locale;
+
+		status = U_ZERO_ERROR;
+		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+											  &status);
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			pg_fatal("could not determine default ICU locale");
+		}
+
+		if (U_ICU_VERSION_MAJOR_NUM >= 54)
+		{
+			const bool	 strict = true;
+			char		*langtag;
+			int			 len;
+
+			len = uloc_toLanguageTag(default_locale, NULL, 0, strict, &status);
+			langtag = pg_malloc(len + 1);
+			status = U_ZERO_ERROR;
+			uloc_toLanguageTag(default_locale, langtag, len + 1, strict,
+							   &status);
+
+			if (U_FAILURE(status))
+			{
+				ucol_close(collator);
+				pg_fatal("could not determine language tag for default locale \"%s\": %s",
+						 default_locale, u_errorName(status));
+			}
+
+			icu_locale = langtag;
+		}
+		else
+			icu_locale = pg_strdup(default_locale);
+	}
+
+	ucol_close(collator);
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2077,8 +2148,7 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		if (!icu_locale)
-			pg_fatal("ICU locale must be specified");
+		check_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
-- 
2.34.1

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#7)

Re: Move defaults toward ICU in 16?

On 2023-02-08 12:16:46 -0800, Jeff Davis wrote:

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah.ï¿½ I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

Unfortunately this fails widely on CI, with both compile time and runtime
issues:
https://cirrus-ci.com/build/5116408950947840

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#8)

2 attachment(s)

Re: Move defaults toward ICU in 16?

On Wed, 2023-02-08 at 18:22 -0800, Andres Freund wrote:

On 2023-02-08 12:16:46 -0800, Jeff Davis wrote:

On Thu, 2023-02-02 at 18:10 -0500, Tom Lane wrote:

Yeah. I would be resistant to making ICU a required dependency,
but it doesn't seem unreasonable to start moving towards it being
our default collation support.

Patch attached.

Unfortunately this fails widely on CI, with both compile time and
runtime

New patches attached.

0001: build defaults to requiring ICU
0002: initdb defaults to using ICU (if server built with ICU)

One CI test is failing: "Windows - Server 2019, VS 2019 - Meson &
ninja"; if I apply Andres patch (
https://github.com/anarazel/postgres/commit/dde7c68 ), then it works.

I ran into one annoyance with pg_upgrade, which is that a v15 cluster
initialized with the defaults requires that the v16 cluster is
initialized with --locale-provider=libc, because otherwise the old and
new cluster will have mismatching template databases. Simple to fix
once you see the error, but I wonder how many initdb scripts might be
broken? I suppose it's just the cost of changing a default? Would an
environment variable help for cases where it's difficult to pass that
extra option down through a script?

I also considered posting another patch to change the default for
CREATE COLLATION, but there are a few issues I'm not sure about. Should
the default be based on whether ICU support is available? Or the
datlocprovider for the current database? And/or some kind of
compatibility GUC?

Notes on the tests I needed to fix, in case they are interesting or
point to some kind of larger problem:

* ecpg has a test that involves setting the client_encoding to LATIN1
which required a compatible server encoding so it was setting
ENCODING=SQL_ASCII, which ICU doesn't support. The ecpg test did not
look particularly sensitive to the locale, so I changed it to use
client_encoding=SQL_ASCII instead, so that the server encoding doesn't
matter.
* citext has a test involving Turkish characters, which works for all
libc locales, but in ICU the test only works in Turkish locales. I skip
the test if datlocprovider='i', because citext doesn't seem very
important in an ICU world.
* unaccent is broken if the database provider is ICU and LC_CTYPE=C,
because the t_isspace() (etc.) functions do not properly handle ICU.
Probably some other things are broken with that combination, but only
this test seems to exercise it. I just skipped the test for that broken
combination, but perhaps it should be fixed in the future.
* initdb was being built with ICU as a dependency in meson, but not
autoconf. I assume it's fine to link ICU into initdb, so I changed the
Makefile.
* I changed a couple tests to initialize with --locale-provider=libc.
They were testing that creating a database with the ICU provider but no
ICU locale fails, and that's easiest to test if the template is libc.
* The CI test CompilerWarnings:mingw_cross_warning was failing because
ICU is not available. I added --without-icu in the .cirrus.yml file and
it works.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v2-0001-Build-ICU-support-by-default.patchtext/x-patch; charset=UTF-8; name=v2-0001-Build-ICU-support-by-default.patchDownload

From b1772b12be3c47a00a3723d2937421cb5bbfe3a3 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Feb 2023 12:08:11 -0800
Subject: [PATCH v2 1/2] Build ICU support by default.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 .cirrus.yml                    |  1 +
 configure                      | 46 ++++++++------------
 configure.ac                   |  8 +++-
 doc/src/sgml/installation.sgml | 78 +++++++++++++++++++---------------
 meson.build                    | 12 +++++-
 meson_options.txt              |  2 +-
 6 files changed, 78 insertions(+), 69 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 6b98980075..24302efea6 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -775,6 +775,7 @@ task:
       time ./configure \
         --host=x86_64-w64-mingw32 \
         --enable-cassert \
+        --without-icu \
         CC="ccache x86_64-w64-mingw32-gcc" \
         CXX="ccache x86_64-w64-mingw32-g++"
       make -s -j${BUILD_JOBS} clean
diff --git a/configure b/configure
index 5d07fd0bb9..ddb75f7f4b 100755
--- a/configure
+++ b/configure
@@ -1558,7 +1558,7 @@ Optional Packages:
                           set WAL block size in kB [8]
   --with-CC=CMD           set compiler (deprecated)
   --with-llvm             build with LLVM based JIT support
-  --with-icu              build with ICU support
+  --without-icu           build without ICU support
   --with-tcl              build Tcl modules (PL/Tcl)
   --with-tclconfig=DIR    tclConfig.sh is in DIR
   --with-perl             build Perl modules (PL/Perl)
@@ -8401,7 +8401,9 @@ $as_echo "#define USE_ICU 1" >>confdefs.h
   esac
 
 else
-  with_icu=no
+  with_icu=yes
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
 
 fi
 
@@ -8470,31 +8472,17 @@ fi
 	# Put the nasty error message in config.log where it belongs
 	echo "$ICU_PKG_ERRORS" >&5
 
-	as_fn_error $? "Package requirements (icu-uc icu-i18n) were not met:
-
-$ICU_PKG_ERRORS
-
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 elif test $pkg_failed = untried; then
         { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
 $as_echo "no" >&6; }
-	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
-
-To get pkg-config, see <http://pkg-config.freedesktop.org/>.
-See \`config.log' for more details" "$LINENO" 5; }
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 else
 	ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
 	ICU_LIBS=$pkg_cv_ICU_LIBS
@@ -15323,7 +15311,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15369,7 +15357,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15393,7 +15381,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15438,7 +15426,7 @@ else
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
@@ -15462,7 +15450,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
     We can't simply define LARGE_OFF_T to be 9223372036854775807,
     since some C++ compilers masquerading as C compilers
     incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
   int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
 		       && LARGE_OFF_T % 2147483647 == 1)
 		      ? 1 : -1];
diff --git a/configure.ac b/configure.ac
index e9b74ced6c..909f5dba3c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -853,13 +853,17 @@ AC_SUBST(enable_thread_safety)
 # ICU
 #
 AC_MSG_CHECKING([whether to build with ICU support])
-PGAC_ARG_BOOL(with, icu, no, [build with ICU support],
+PGAC_ARG_BOOL(with, icu, yes, [build without ICU support],
               [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
 AC_MSG_RESULT([$with_icu])
 AC_SUBST(with_icu)
 
 if test "$with_icu" = yes; then
-  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n)
+  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n, [],
+    [AC_MSG_ERROR([ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support.])])
 fi
 
 #
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 6619e69462..b5f88d27df 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -146,6 +146,35 @@ documentation.  See standalone-profile.xsl for details.
       <application>pg_restore</application>.
      </para>
     </listitem>
+
+    <listitem>
+     <para>
+      The ICU locale provider (see <xref linkend="locale-providers"/>) is used by default. If you don't want to use it then you must specify the <option>--without-icu</option> option to <filename>configure</filename>. Using this option disables support for ICU collation features (see <xref linkend="collation"/>).
+     </para>
+     <para>
+      ICU support requires the <productname>ICU4C</productname> package to be
+      installed.  The minimum required version of
+      <productname>ICU4C</productname> is currently 4.2.
+     </para>
+
+     <para>
+      By default,
+      <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
+      will be used to find the required compilation options.  This is
+      supported for <productname>ICU4C</productname> version 4.6 and later.
+      For older versions, or if <productname>pkg-config</productname> is not
+      available, the variables <envar>ICU_CFLAGS</envar> and
+      <envar>ICU_LIBS</envar> can be specified to
+      <filename>configure</filename>, like in this example:
+<programlisting>
+./configure ... ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
+</programlisting>
+      (If <productname>ICU4C</productname> is in the default search path
+      for the compiler, then you still need to specify nonempty strings in
+      order to avoid use of <productname>pkg-config</productname>, for
+      example, <literal>ICU_CFLAGS=' '</literal>.)
+     </para>
+    </listitem>
    </itemizedlist>
   </para>
 
@@ -926,40 +955,6 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
-      <varlistentry id="configure-option-with-icu">
-       <term><option>--with-icu</option></term>
-       <listitem>
-        <para>
-         Build with support for
-         the <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
-         library, enabling use of ICU collation
-         features<phrase condition="standalone-ignore"> (see
-         <xref linkend="collation"/>)</phrase>.
-         This requires the <productname>ICU4C</productname> package
-         to be installed.  The minimum required version
-         of <productname>ICU4C</productname> is currently 4.2.
-        </para>
-
-        <para>
-         By default,
-         <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
-         will be used to find the required compilation options.  This is
-         supported for <productname>ICU4C</productname> version 4.6 and later.
-         For older versions, or if <productname>pkg-config</productname> is
-         not available, the variables <envar>ICU_CFLAGS</envar>
-         and <envar>ICU_LIBS</envar> can be specified
-         to <filename>configure</filename>, like in this example:
-<programlisting>
-./configure ... --with-icu ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
-</programlisting>
-         (If <productname>ICU4C</productname> is in the default search path
-         for the compiler, then you still need to specify nonempty strings in
-         order to avoid use of <productname>pkg-config</productname>, for
-         example, <literal>ICU_CFLAGS=' '</literal>.)
-        </para>
-       </listitem>
-      </varlistentry>
-
       <varlistentry id="configure-with-llvm">
        <term><option>--with-llvm</option></term>
        <listitem>
@@ -1231,6 +1226,19 @@ build-postgresql:
 
      <variablelist>
 
+      <varlistentry id="configure-option-without-icu">
+       <term><option>--without-icu</option></term>
+       <listitem>
+        <para>
+         Build without support for the
+         <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
+         library, disabling the use of ICU collation features<phrase
+         condition="standalone-ignore"> (see <xref
+         linkend="collation"/>)</phrase>.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-without-readline">
        <term><option>--without-readline</option></term>
        <listitem>
@@ -2419,7 +2427,7 @@ ninja install
         <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
         library, enabling use of ICU collation features<phrase
         condition="standalone-ignore"> (see <xref
-        linkend="collation"/>)</phrase>.  Defaults to auto and requires the
+        linkend="collation"/>)</phrase>.  Defaults to enabled and requires the
         <productname>ICU4C</productname> package to be installed.  The minimum
         required version of <productname>ICU4C</productname> is currently 4.2.
        </para>
diff --git a/meson.build b/meson.build
index e379a252a5..9cd7491fda 100644
--- a/meson.build
+++ b/meson.build
@@ -721,8 +721,16 @@ endif
 
 icuopt = get_option('icu')
 if not icuopt.disabled()
-  icu = dependency('icu-uc', required: icuopt.enabled())
-  icu_i18n = dependency('icu-i18n', required: icuopt.enabled())
+  icu = dependency('icu-uc', required: false)
+  icu_i18n = dependency('icu-i18n', required: false)
+
+  icu_found = icu.found() and icu_i18n.found()
+  if icuopt.enabled() and not icu_found
+     error('''ICU library not found
+If you have ICU already installed, see meson-log/meson-log.txt for details on the
+failure. It is possible the compiler isn't looking in the proper directory.
+Use -Dicu=disabled to disable ICU support.''')
+  endif
 
   if icu.found()
     cdata.set('USE_ICU', 1)
diff --git a/meson_options.txt b/meson_options.txt
index 9d3ef4aa20..fbd4e69356 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -82,7 +82,7 @@ option('dtrace', type : 'feature', value: 'disabled',
 option('gssapi', type : 'feature', value: 'auto',
   description: 'GSSAPI support')
 
-option('icu', type : 'feature', value: 'auto',
+option('icu', type : 'feature', value: 'enabled',
   description: 'ICU support')
 
 option('ldap', type : 'feature', value: 'auto',
-- 
2.34.1

v2-0002-Use-ICU-by-default-at-initdb-time.patchtext/x-patch; charset=UTF-8; name=v2-0002-Use-ICU-by-default-at-initdb-time.patchDownload

From 64ad3a3b0c8bcdf95df90a9db05edb255a3f18c2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 8 Feb 2023 12:06:26 -0800
Subject: [PATCH v2 2/2] Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 contrib/citext/expected/citext_utf8.out       |  4 +-
 contrib/citext/expected/citext_utf8_1.out     |  4 +-
 contrib/citext/sql/citext_utf8.sql            |  4 +-
 contrib/unaccent/expected/unaccent.out        |  9 +++
 contrib/unaccent/expected/unaccent_1.out      |  8 ++
 contrib/unaccent/sql/unaccent.sql             | 11 +++
 doc/src/sgml/ref/initdb.sgml                  | 48 +++++++-----
 src/bin/initdb/Makefile                       |  4 +-
 src/bin/initdb/initdb.c                       | 76 ++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                |  7 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  2 +-
 src/bin/scripts/t/020_createdb.pl             |  2 +-
 src/interfaces/ecpg/test/Makefile             |  3 -
 src/interfaces/ecpg/test/connect/test5.pgc    |  2 +-
 .../ecpg/test/expected/connect-test5.c        |  2 +-
 .../ecpg/test/expected/connect-test5.stderr   |  2 +-
 src/interfaces/ecpg/test/meson.build          |  1 -
 src/test/icu/t/010_database.pl                |  2 +-
 18 files changed, 149 insertions(+), 42 deletions(-)
 create mode 100644 contrib/unaccent/expected/unaccent_1.out

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 666b07ccec..85ce9c3b64 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 433e985349..60ddebd841 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index d068000b42..e9c504e764 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -4,7 +4,9 @@
  */
 
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/unaccent/expected/unaccent.out b/contrib/unaccent/expected/unaccent.out
index ee0ac71a1c..cef98ee60c 100644
--- a/contrib/unaccent/expected/unaccent.out
+++ b/contrib/unaccent/expected/unaccent.out
@@ -1,3 +1,12 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
 CREATE EXTENSION unaccent;
 -- must have a UTF8 database
 SELECT getdatabaseencoding();
diff --git a/contrib/unaccent/expected/unaccent_1.out b/contrib/unaccent/expected/unaccent_1.out
new file mode 100644
index 0000000000..0a4a3838ab
--- /dev/null
+++ b/contrib/unaccent/expected/unaccent_1.out
@@ -0,0 +1,8 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 3fc0c706be..027dfb964a 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -1,3 +1,14 @@
+
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
 CREATE EXTENSION unaccent;
 
 -- must have a UTF8 database
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..4f37386ea3 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -89,10 +89,28 @@ PostgreSQL documentation
    and character set encoding. These can also be set separately for each
    database when it is created. <command>initdb</command> determines those
    settings for the template databases, which will serve as the default for
-   all other databases.  By default, <command>initdb</command> uses the
-   locale provider <literal>libc</literal>, takes the locale settings from
-   the environment, and determines the encoding from the locale settings.
-   This is almost always sufficient, unless there are special requirements.
+   all other databases.
+  </para>
+
+  <para>
+   By default, <command>initdb</command> uses the ICU library to provide
+   locale services if the server was built with ICU support; otherwise it uses
+   the <literal>libc</literal> locale provider (see <xref
+   linkend="locale-providers"/>). To choose the specific ICU locale ID to
+   apply, use the option <option>--icu-locale</option>.  Note that for
+   implementation reasons and to support legacy code,
+   <command>initdb</command> will still select and initialize libc locale
+   settings when the ICU locale provider is used.
+  </para>
+
+  <para>
+   Alternatively, <command>initdb</command> can use the locale provider
+   <literal>libc</literal>. To select this option, specify
+   <literal>--locale-provider=libc</literal>, or build the server without ICU
+   support. The <literal>libc</literal> locale provider takes the locale
+   settings from the environment, and determines the encoding from the locale
+   settings.  This is almost always sufficient, unless there are special
+   requirements.
   </para>
 
   <para>
@@ -103,17 +121,6 @@ PostgreSQL documentation
    categories can give nonsensical results, so this should be used with care.
   </para>
 
-  <para>
-   Alternatively, the ICU library can be used to provide locale services.
-   (Again, this only sets the default for subsequently created databases.)  To
-   select this option, specify <literal>--locale-provider=icu</literal>.
-   To choose the specific ICU locale ID to apply, use the option
-   <option>--icu-locale</option>.  Note that
-   for implementation reasons and to support legacy code,
-   <command>initdb</command> will still select and initialize libc locale
-   settings when the ICU locale provider is used.
-  </para>
-
   <para>
    When <command>initdb</command> runs, it will print out the locale settings
    it has chosen.  If you have complex requirements or specified multiple
@@ -234,7 +241,8 @@ PostgreSQL documentation
       <term><option>--icu-locale=<replaceable>locale</replaceable></option></term>
       <listitem>
        <para>
-        Specifies the ICU locale ID, if the ICU locale provider is used.
+        Specifies the ICU locale ID, if the ICU locale provider is used. By
+        default, ICU obtains the ICU locale from the ICU default collator.
        </para>
       </listitem>
      </varlistentry>
@@ -297,10 +305,12 @@ PostgreSQL documentation
       <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
       <listitem>
        <para>
-        This option sets the locale provider for databases created in the
-        new cluster.  It can be overridden in the <command>CREATE
+        This option sets the locale provider for databases created in the new
+        cluster.  It can be overridden in the <command>CREATE
         DATABASE</command> command when new databases are subsequently
-        created.  The default is <literal>libc</literal>.
+        created.  The default is <literal>icu</literal> if the server was
+        built with ICU support; otherwise the default is
+        <literal>libc</literal> (see <xref linkend="locale-providers"/>).
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index eab89c5501..d69bd89572 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -16,7 +16,7 @@ subdir = src/bin/initdb
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
 
 # Note: it's important that we link to encnames.o from libpgcommon, not
 # from libpq, else we have risks of version skew if we run with a libpq
@@ -24,7 +24,7 @@ override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
 # should ensure that that happens.
 #
 # We need libpq only because fe_utils does.
-LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) $(ICU_LIBS)
 
 # use system timezone data?
 ifneq (,$(with_system_tzdata))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..0776294499 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -53,6 +53,9 @@
 #include <netdb.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
 #include <unistd.h>
 #include <signal.h>
 #include <time.h>
@@ -133,7 +136,11 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+#ifdef USE_ICU
+static char locale_provider = COLLPROVIDER_ICU;
+#else
 static char locale_provider = COLLPROVIDER_LIBC;
+#endif
 static char *icu_locale = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
@@ -2024,6 +2031,72 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Check that ICU accepts the locale name; or if not specified, retrieve the
+ * default ICU locale.
+ */
+static void
+check_icu_locale()
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+	{
+		if (icu_locale)
+			pg_fatal("could not open collator for locale \"%s\": %s",
+					 icu_locale, u_errorName(status));
+		else
+			pg_fatal("could not open collator for default locale: %s",
+					 u_errorName(status));
+	}
+
+	/* if not specified, get locale from default collator */
+	if (icu_locale == NULL)
+	{
+		const char	*default_locale;
+
+		status = U_ZERO_ERROR;
+		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+											  &status);
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			pg_fatal("could not determine default ICU locale");
+		}
+
+		if (U_ICU_VERSION_MAJOR_NUM >= 54)
+		{
+			const bool	 strict = true;
+			char		*langtag;
+			int			 len;
+
+			len = uloc_toLanguageTag(default_locale, NULL, 0, strict, &status);
+			langtag = pg_malloc(len + 1);
+			status = U_ZERO_ERROR;
+			uloc_toLanguageTag(default_locale, langtag, len + 1, strict,
+							   &status);
+
+			if (U_FAILURE(status))
+			{
+				ucol_close(collator);
+				pg_fatal("could not determine language tag for default locale \"%s\": %s",
+						 default_locale, u_errorName(status));
+			}
+
+			icu_locale = langtag;
+		}
+		else
+			icu_locale = pg_strdup(default_locale);
+	}
+
+	ucol_close(collator);
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2077,8 +2150,7 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		if (!icu_locale)
-			pg_fatal("ICU locale must be specified");
+		check_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 772769acab..e5d214e09c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -97,11 +97,6 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
-		[ 'initdb', '--no-sync', '--locale-provider=icu', "$tempdir/data2" ],
-		qr/initdb: error: ICU locale must be specified/,
-		'locale provider ICU requires --icu-locale');
-
 	command_ok(
 		[
 			'initdb',                '--no-sync',
@@ -116,7 +111,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu', '--icu-locale=@colNumeric=lower',
 			"$tempdir/dataX"
 		],
-		qr/FATAL:  could not open collator for locale/,
+		qr/error: could not open collator for locale/,
 		'fails for invalid ICU locale');
 
 	command_fails_like(
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d92247c915..30294f381c 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1684,7 +1684,7 @@ my %tests = (
 		create_sql =>
 		  "CREATE DATABASE dump_test2 LOCALE = 'C' TEMPLATE = template0;",
 		regexp => qr/^
-			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C';\E
+			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C'\E
 			/xm,
 		like => { pg_dumpall_dbprivs => 1, },
 	},
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 3ad4fbb00c..8ec58cdd64 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -13,7 +13,7 @@ program_version_ok('createdb');
 program_options_handling_ok('createdb');
 
 my $node = PostgreSQL::Test::Cluster->new('main');
-$node->init;
+$node->init(extra => ['--locale-provider=libc']);
 $node->start;
 
 $node->issues_sql_like(
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index d7a7d1d1ca..cf841a3a5b 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -14,9 +14,6 @@ override CPPFLAGS := \
 	'-DSHELLPROG="$(SHELL)"' \
 	$(CPPFLAGS)
 
-# default encoding for regression tests
-ENCODING = SQL_ASCII
-
 ifneq ($(build_os),mingw32)
 abs_builddir := $(shell pwd)
 else
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index de29160089..d512553677 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -55,7 +55,7 @@ exec sql end declare section;
 	exec sql connect to 'unix:postgresql://localhost/ecpg2_regression' as main user :user USING "connectpw";
 	exec sql disconnect main;
 
-	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=latin1 as main user regress_ecpg_user1/connectpw;
+	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=sql_ascii as main user regress_ecpg_user1/connectpw;
 	exec sql disconnect main;
 
 	exec sql connect to "unix:postgresql://200.46.204.71/ecpg2_regression" as main user regress_ecpg_user1/connectpw;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index c1124c627f..ec1514ed9a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -117,7 +117,7 @@ main(void)
 #line 56 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=sql_ascii" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 58 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.stderr b/src/interfaces/ecpg/test/expected/connect-test5.stderr
index 01a6a0a13b..51cc18916a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.stderr
+++ b/src/interfaces/ecpg/test/expected/connect-test5.stderr
@@ -50,7 +50,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=latin1 for user regress_ecpg_user1
+[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=sql_ascii for user regress_ecpg_user1
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
diff --git a/src/interfaces/ecpg/test/meson.build b/src/interfaces/ecpg/test/meson.build
index d0be73ccf9..04c6819a79 100644
--- a/src/interfaces/ecpg/test/meson.build
+++ b/src/interfaces/ecpg/test/meson.build
@@ -69,7 +69,6 @@ ecpg_test_files = files(
 ecpg_regress_args = [
   '--dbname=ecpg1_regression,ecpg2_regression',
   '--create-role=regress_ecpg_user1,regress_ecpg_user2',
-  '--encoding=SQL_ASCII',
 ]
 
 tests += {
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 80ab1c7789..45d77c319a 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -12,7 +12,7 @@ if ($ENV{with_icu} ne 'yes')
 }
 
 my $node1 = PostgreSQL::Test::Cluster->new('node1');
-$node1->init;
+$node1->init(extra => ['--locale-provider=libc']);
 $node1->start;
 
 $node1->safe_psql('postgres',
-- 
2.34.1

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#9)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-10 16:17:00 -0800, Jeff Davis wrote:

One CI test is failing: "Windows - Server 2019, VS 2019 - Meson &
ninja"; if I apply Andres patch (
https://github.com/anarazel/postgres/commit/dde7c68 ), then it works.

Until something like my patch above is done more generally applicable, I think
your patch should disable ICU on windows. Can't just fail to build.

Perhaps we don't need to force ICU use to on with the meson build, given that
it defaults to auto-detection?

I ran into one annoyance with pg_upgrade, which is that a v15 cluster
initialized with the defaults requires that the v16 cluster is
initialized with --locale-provider=libc, because otherwise the old and
new cluster will have mismatching template databases. Simple to fix
once you see the error, but I wonder how many initdb scripts might be
broken? I suppose it's just the cost of changing a default? Would an
environment variable help for cases where it's difficult to pass that
extra option down through a script?

That seems problematic to me.

But, shouldn't pg_upgrade be able to deal with this? As long as the databases
are created with template0, we can create the collations at that point?

@@ -15323,7 +15311,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807.  */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];

This stuff shouldn't be in here, it's due to a debian patched autoconf.

Greetings,

Andres Freund

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#1)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-02 at 05:13 -0800, Jeff Davis wrote:

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

One consideration here is security. Any vulnerability in ICU collation
routines could easily become a vulnerability in Postgres.

I looked at these lists:

https://www.cvedetails.com/vulnerability-list/vendor_id-17477/Icu-project.html
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=icu
https://unicode-org.atlassian.net/issues/?jql=labels%20%3D%20%22security%22
https://unicode-org.atlassian.net/issues/?jql=labels%20%3D%20%22was_sensitive%22

Here are the recent CVEs:

CVE-2021-30535 https://unicode-org.atlassian.net/browse/ICU-21587
CVE-2020-21913 https://unicode-org.atlassian.net/browse/ICU-20850
CVE-2020-10531 https://unicode-org.atlassian.net/browse/ICU-20958

But there are quite a few JIRAs that look concerning that don't have a
CVE assigned:

2021 https://unicode-org.atlassian.net/browse/ICU-21537
2021 https://unicode-org.atlassian.net/browse/ICU-21597
2021 https://unicode-org.atlassian.net/browse/ICU-21676
2021 https://unicode-org.atlassian.net/browse/ICU-21749

Not sure which of these are exploitable, and if they are, why they
don't have a CVE. If someone else finds more issues, please let me
know.

The good news is that the Chrome/Chromium projects are actively finding
and reporting issues.

I didn't look for comparable information about glibc, but I would guess
that exploitable memory errors in setlocale/strcoll are very rare,
otherwise it would be a security disaster for many projects.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#10)

2 attachment(s)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-10 at 18:00 -0800, Andres Freund wrote:

Until something like my patch above is done more generally
applicable, I think
your patch should disable ICU on windows. Can't just fail to build.

Perhaps we don't need to force ICU use to on with the meson build,
given that
it defaults to auto-detection?

Done. I changed it back to 'auto', and tests pass.

But, shouldn't pg_upgrade be able to deal with this? As long as the
databases
are created with template0, we can create the collations at that
point?

Are you saying that the upgraded cluster could have a different default
collation for the template databases than the original cluster?

That would be wrong to do, at least by default, but I could see it
being a useful option.

Or maybe I misunderstand what you're saying?

This stuff shouldn't be in here, it's due to a debian patched
autoconf.

Removed, thank you.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v3-0001-Build-ICU-support-by-default.patchtext/x-patch; charset=UTF-8; name=v3-0001-Build-ICU-support-by-default.patchDownload

From 0a691bdc1952871b2ec8d1c5086c90c8943d99cb Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Feb 2023 12:08:11 -0800
Subject: [PATCH v3 1/2] Build ICU support by default.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 .cirrus.yml                    |  1 +
 configure                      | 36 ++++++----------
 configure.ac                   |  8 +++-
 doc/src/sgml/installation.sgml | 76 +++++++++++++++++++---------------
 4 files changed, 61 insertions(+), 60 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index f212978752..34450a9c7b 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -751,6 +751,7 @@ task:
       time ./configure \
         --host=x86_64-w64-mingw32 \
         --enable-cassert \
+        --without-icu \
         CC="ccache x86_64-w64-mingw32-gcc" \
         CXX="ccache x86_64-w64-mingw32-g++"
       make -s -j${BUILD_JOBS} clean
diff --git a/configure b/configure
index 5d07fd0bb9..507ca3c983 100755
--- a/configure
+++ b/configure
@@ -1558,7 +1558,7 @@ Optional Packages:
                           set WAL block size in kB [8]
   --with-CC=CMD           set compiler (deprecated)
   --with-llvm             build with LLVM based JIT support
-  --with-icu              build with ICU support
+  --without-icu           build without ICU support
   --with-tcl              build Tcl modules (PL/Tcl)
   --with-tclconfig=DIR    tclConfig.sh is in DIR
   --with-perl             build Perl modules (PL/Perl)
@@ -8401,7 +8401,9 @@ $as_echo "#define USE_ICU 1" >>confdefs.h
   esac
 
 else
-  with_icu=no
+  with_icu=yes
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
 
 fi
 
@@ -8470,31 +8472,17 @@ fi
 	# Put the nasty error message in config.log where it belongs
 	echo "$ICU_PKG_ERRORS" >&5
 
-	as_fn_error $? "Package requirements (icu-uc icu-i18n) were not met:
-
-$ICU_PKG_ERRORS
-
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 elif test $pkg_failed = untried; then
         { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
 $as_echo "no" >&6; }
-	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
-
-To get pkg-config, see <http://pkg-config.freedesktop.org/>.
-See \`config.log' for more details" "$LINENO" 5; }
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 else
 	ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
 	ICU_LIBS=$pkg_cv_ICU_LIBS
diff --git a/configure.ac b/configure.ac
index e9b74ced6c..909f5dba3c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -853,13 +853,17 @@ AC_SUBST(enable_thread_safety)
 # ICU
 #
 AC_MSG_CHECKING([whether to build with ICU support])
-PGAC_ARG_BOOL(with, icu, no, [build with ICU support],
+PGAC_ARG_BOOL(with, icu, yes, [build without ICU support],
               [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
 AC_MSG_RESULT([$with_icu])
 AC_SUBST(with_icu)
 
 if test "$with_icu" = yes; then
-  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n)
+  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n, [],
+    [AC_MSG_ERROR([ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support.])])
 fi
 
 #
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 6619e69462..bad6bc6e5e 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -146,6 +146,35 @@ documentation.  See standalone-profile.xsl for details.
       <application>pg_restore</application>.
      </para>
     </listitem>
+
+    <listitem>
+     <para>
+      The ICU locale provider (see <xref linkend="locale-providers"/>) is used by default. If you don't want to use it then you must specify the <option>--without-icu</option> option to <filename>configure</filename>. Using this option disables support for ICU collation features (see <xref linkend="collation"/>).
+     </para>
+     <para>
+      ICU support requires the <productname>ICU4C</productname> package to be
+      installed.  The minimum required version of
+      <productname>ICU4C</productname> is currently 4.2.
+     </para>
+
+     <para>
+      By default,
+      <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
+      will be used to find the required compilation options.  This is
+      supported for <productname>ICU4C</productname> version 4.6 and later.
+      For older versions, or if <productname>pkg-config</productname> is not
+      available, the variables <envar>ICU_CFLAGS</envar> and
+      <envar>ICU_LIBS</envar> can be specified to
+      <filename>configure</filename>, like in this example:
+<programlisting>
+./configure ... ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
+</programlisting>
+      (If <productname>ICU4C</productname> is in the default search path
+      for the compiler, then you still need to specify nonempty strings in
+      order to avoid use of <productname>pkg-config</productname>, for
+      example, <literal>ICU_CFLAGS=' '</literal>.)
+     </para>
+    </listitem>
    </itemizedlist>
   </para>
 
@@ -926,40 +955,6 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
-      <varlistentry id="configure-option-with-icu">
-       <term><option>--with-icu</option></term>
-       <listitem>
-        <para>
-         Build with support for
-         the <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
-         library, enabling use of ICU collation
-         features<phrase condition="standalone-ignore"> (see
-         <xref linkend="collation"/>)</phrase>.
-         This requires the <productname>ICU4C</productname> package
-         to be installed.  The minimum required version
-         of <productname>ICU4C</productname> is currently 4.2.
-        </para>
-
-        <para>
-         By default,
-         <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
-         will be used to find the required compilation options.  This is
-         supported for <productname>ICU4C</productname> version 4.6 and later.
-         For older versions, or if <productname>pkg-config</productname> is
-         not available, the variables <envar>ICU_CFLAGS</envar>
-         and <envar>ICU_LIBS</envar> can be specified
-         to <filename>configure</filename>, like in this example:
-<programlisting>
-./configure ... --with-icu ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
-</programlisting>
-         (If <productname>ICU4C</productname> is in the default search path
-         for the compiler, then you still need to specify nonempty strings in
-         order to avoid use of <productname>pkg-config</productname>, for
-         example, <literal>ICU_CFLAGS=' '</literal>.)
-        </para>
-       </listitem>
-      </varlistentry>
-
       <varlistentry id="configure-with-llvm">
        <term><option>--with-llvm</option></term>
        <listitem>
@@ -1231,6 +1226,19 @@ build-postgresql:
 
      <variablelist>
 
+      <varlistentry id="configure-option-without-icu">
+       <term><option>--without-icu</option></term>
+       <listitem>
+        <para>
+         Build without support for the
+         <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
+         library, disabling the use of ICU collation features<phrase
+         condition="standalone-ignore"> (see <xref
+         linkend="collation"/>)</phrase>.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-without-readline">
        <term><option>--without-readline</option></term>
        <listitem>
-- 
2.34.1

v3-0002-Use-ICU-by-default-at-initdb-time.patchtext/x-patch; charset=UTF-8; name=v3-0002-Use-ICU-by-default-at-initdb-time.patchDownload

From 9818b4b8c5ec15aa6cbf0e6db6ae888173885dca Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 8 Feb 2023 12:06:26 -0800
Subject: [PATCH v3 2/2] Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 contrib/citext/expected/citext_utf8.out       |  4 +-
 contrib/citext/expected/citext_utf8_1.out     |  4 +-
 contrib/citext/sql/citext_utf8.sql            |  4 +-
 contrib/unaccent/expected/unaccent.out        |  9 +++
 contrib/unaccent/expected/unaccent_1.out      |  8 ++
 contrib/unaccent/sql/unaccent.sql             | 11 +++
 doc/src/sgml/ref/initdb.sgml                  | 48 +++++++-----
 src/bin/initdb/Makefile                       |  4 +-
 src/bin/initdb/initdb.c                       | 76 ++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                |  7 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  2 +-
 src/bin/scripts/t/020_createdb.pl             |  2 +-
 src/interfaces/ecpg/test/Makefile             |  3 -
 src/interfaces/ecpg/test/connect/test5.pgc    |  2 +-
 .../ecpg/test/expected/connect-test5.c        |  2 +-
 .../ecpg/test/expected/connect-test5.stderr   |  2 +-
 src/interfaces/ecpg/test/meson.build          |  1 -
 src/test/icu/t/010_database.pl                |  2 +-
 18 files changed, 149 insertions(+), 42 deletions(-)
 create mode 100644 contrib/unaccent/expected/unaccent_1.out

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 666b07ccec..85ce9c3b64 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 433e985349..60ddebd841 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index d068000b42..e9c504e764 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -4,7 +4,9 @@
  */
 
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/unaccent/expected/unaccent.out b/contrib/unaccent/expected/unaccent.out
index ee0ac71a1c..cef98ee60c 100644
--- a/contrib/unaccent/expected/unaccent.out
+++ b/contrib/unaccent/expected/unaccent.out
@@ -1,3 +1,12 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
 CREATE EXTENSION unaccent;
 -- must have a UTF8 database
 SELECT getdatabaseencoding();
diff --git a/contrib/unaccent/expected/unaccent_1.out b/contrib/unaccent/expected/unaccent_1.out
new file mode 100644
index 0000000000..0a4a3838ab
--- /dev/null
+++ b/contrib/unaccent/expected/unaccent_1.out
@@ -0,0 +1,8 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 3fc0c706be..027dfb964a 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -1,3 +1,14 @@
+
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
 CREATE EXTENSION unaccent;
 
 -- must have a UTF8 database
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..4f37386ea3 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -89,10 +89,28 @@ PostgreSQL documentation
    and character set encoding. These can also be set separately for each
    database when it is created. <command>initdb</command> determines those
    settings for the template databases, which will serve as the default for
-   all other databases.  By default, <command>initdb</command> uses the
-   locale provider <literal>libc</literal>, takes the locale settings from
-   the environment, and determines the encoding from the locale settings.
-   This is almost always sufficient, unless there are special requirements.
+   all other databases.
+  </para>
+
+  <para>
+   By default, <command>initdb</command> uses the ICU library to provide
+   locale services if the server was built with ICU support; otherwise it uses
+   the <literal>libc</literal> locale provider (see <xref
+   linkend="locale-providers"/>). To choose the specific ICU locale ID to
+   apply, use the option <option>--icu-locale</option>.  Note that for
+   implementation reasons and to support legacy code,
+   <command>initdb</command> will still select and initialize libc locale
+   settings when the ICU locale provider is used.
+  </para>
+
+  <para>
+   Alternatively, <command>initdb</command> can use the locale provider
+   <literal>libc</literal>. To select this option, specify
+   <literal>--locale-provider=libc</literal>, or build the server without ICU
+   support. The <literal>libc</literal> locale provider takes the locale
+   settings from the environment, and determines the encoding from the locale
+   settings.  This is almost always sufficient, unless there are special
+   requirements.
   </para>
 
   <para>
@@ -103,17 +121,6 @@ PostgreSQL documentation
    categories can give nonsensical results, so this should be used with care.
   </para>
 
-  <para>
-   Alternatively, the ICU library can be used to provide locale services.
-   (Again, this only sets the default for subsequently created databases.)  To
-   select this option, specify <literal>--locale-provider=icu</literal>.
-   To choose the specific ICU locale ID to apply, use the option
-   <option>--icu-locale</option>.  Note that
-   for implementation reasons and to support legacy code,
-   <command>initdb</command> will still select and initialize libc locale
-   settings when the ICU locale provider is used.
-  </para>
-
   <para>
    When <command>initdb</command> runs, it will print out the locale settings
    it has chosen.  If you have complex requirements or specified multiple
@@ -234,7 +241,8 @@ PostgreSQL documentation
       <term><option>--icu-locale=<replaceable>locale</replaceable></option></term>
       <listitem>
        <para>
-        Specifies the ICU locale ID, if the ICU locale provider is used.
+        Specifies the ICU locale ID, if the ICU locale provider is used. By
+        default, ICU obtains the ICU locale from the ICU default collator.
        </para>
       </listitem>
      </varlistentry>
@@ -297,10 +305,12 @@ PostgreSQL documentation
       <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
       <listitem>
        <para>
-        This option sets the locale provider for databases created in the
-        new cluster.  It can be overridden in the <command>CREATE
+        This option sets the locale provider for databases created in the new
+        cluster.  It can be overridden in the <command>CREATE
         DATABASE</command> command when new databases are subsequently
-        created.  The default is <literal>libc</literal>.
+        created.  The default is <literal>icu</literal> if the server was
+        built with ICU support; otherwise the default is
+        <literal>libc</literal> (see <xref linkend="locale-providers"/>).
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index eab89c5501..d69bd89572 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -16,7 +16,7 @@ subdir = src/bin/initdb
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
 
 # Note: it's important that we link to encnames.o from libpgcommon, not
 # from libpq, else we have risks of version skew if we run with a libpq
@@ -24,7 +24,7 @@ override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
 # should ensure that that happens.
 #
 # We need libpq only because fe_utils does.
-LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) $(ICU_LIBS)
 
 # use system timezone data?
 ifneq (,$(with_system_tzdata))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..0776294499 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -53,6 +53,9 @@
 #include <netdb.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
 #include <unistd.h>
 #include <signal.h>
 #include <time.h>
@@ -133,7 +136,11 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+#ifdef USE_ICU
+static char locale_provider = COLLPROVIDER_ICU;
+#else
 static char locale_provider = COLLPROVIDER_LIBC;
+#endif
 static char *icu_locale = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
@@ -2024,6 +2031,72 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Check that ICU accepts the locale name; or if not specified, retrieve the
+ * default ICU locale.
+ */
+static void
+check_icu_locale()
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+	{
+		if (icu_locale)
+			pg_fatal("could not open collator for locale \"%s\": %s",
+					 icu_locale, u_errorName(status));
+		else
+			pg_fatal("could not open collator for default locale: %s",
+					 u_errorName(status));
+	}
+
+	/* if not specified, get locale from default collator */
+	if (icu_locale == NULL)
+	{
+		const char	*default_locale;
+
+		status = U_ZERO_ERROR;
+		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+											  &status);
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			pg_fatal("could not determine default ICU locale");
+		}
+
+		if (U_ICU_VERSION_MAJOR_NUM >= 54)
+		{
+			const bool	 strict = true;
+			char		*langtag;
+			int			 len;
+
+			len = uloc_toLanguageTag(default_locale, NULL, 0, strict, &status);
+			langtag = pg_malloc(len + 1);
+			status = U_ZERO_ERROR;
+			uloc_toLanguageTag(default_locale, langtag, len + 1, strict,
+							   &status);
+
+			if (U_FAILURE(status))
+			{
+				ucol_close(collator);
+				pg_fatal("could not determine language tag for default locale \"%s\": %s",
+						 default_locale, u_errorName(status));
+			}
+
+			icu_locale = langtag;
+		}
+		else
+			icu_locale = pg_strdup(default_locale);
+	}
+
+	ucol_close(collator);
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2077,8 +2150,7 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		if (!icu_locale)
-			pg_fatal("ICU locale must be specified");
+		check_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 772769acab..e5d214e09c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -97,11 +97,6 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
-		[ 'initdb', '--no-sync', '--locale-provider=icu', "$tempdir/data2" ],
-		qr/initdb: error: ICU locale must be specified/,
-		'locale provider ICU requires --icu-locale');
-
 	command_ok(
 		[
 			'initdb',                '--no-sync',
@@ -116,7 +111,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu', '--icu-locale=@colNumeric=lower',
 			"$tempdir/dataX"
 		],
-		qr/FATAL:  could not open collator for locale/,
+		qr/error: could not open collator for locale/,
 		'fails for invalid ICU locale');
 
 	command_fails_like(
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d92247c915..30294f381c 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1684,7 +1684,7 @@ my %tests = (
 		create_sql =>
 		  "CREATE DATABASE dump_test2 LOCALE = 'C' TEMPLATE = template0;",
 		regexp => qr/^
-			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C';\E
+			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C'\E
 			/xm,
 		like => { pg_dumpall_dbprivs => 1, },
 	},
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 3ad4fbb00c..8ec58cdd64 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -13,7 +13,7 @@ program_version_ok('createdb');
 program_options_handling_ok('createdb');
 
 my $node = PostgreSQL::Test::Cluster->new('main');
-$node->init;
+$node->init(extra => ['--locale-provider=libc']);
 $node->start;
 
 $node->issues_sql_like(
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index d7a7d1d1ca..cf841a3a5b 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -14,9 +14,6 @@ override CPPFLAGS := \
 	'-DSHELLPROG="$(SHELL)"' \
 	$(CPPFLAGS)
 
-# default encoding for regression tests
-ENCODING = SQL_ASCII
-
 ifneq ($(build_os),mingw32)
 abs_builddir := $(shell pwd)
 else
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index de29160089..d512553677 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -55,7 +55,7 @@ exec sql end declare section;
 	exec sql connect to 'unix:postgresql://localhost/ecpg2_regression' as main user :user USING "connectpw";
 	exec sql disconnect main;
 
-	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=latin1 as main user regress_ecpg_user1/connectpw;
+	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=sql_ascii as main user regress_ecpg_user1/connectpw;
 	exec sql disconnect main;
 
 	exec sql connect to "unix:postgresql://200.46.204.71/ecpg2_regression" as main user regress_ecpg_user1/connectpw;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index c1124c627f..ec1514ed9a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -117,7 +117,7 @@ main(void)
 #line 56 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=sql_ascii" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 58 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.stderr b/src/interfaces/ecpg/test/expected/connect-test5.stderr
index 01a6a0a13b..51cc18916a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.stderr
+++ b/src/interfaces/ecpg/test/expected/connect-test5.stderr
@@ -50,7 +50,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=latin1 for user regress_ecpg_user1
+[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=sql_ascii for user regress_ecpg_user1
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
diff --git a/src/interfaces/ecpg/test/meson.build b/src/interfaces/ecpg/test/meson.build
index d0be73ccf9..04c6819a79 100644
--- a/src/interfaces/ecpg/test/meson.build
+++ b/src/interfaces/ecpg/test/meson.build
@@ -69,7 +69,6 @@ ecpg_test_files = files(
 ecpg_regress_args = [
   '--dbname=ecpg1_regression,ecpg2_regression',
   '--create-role=regress_ecpg_user1,regress_ecpg_user2',
-  '--encoding=SQL_ASCII',
 ]
 
 tests += {
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 80ab1c7789..45d77c319a 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -12,7 +12,7 @@ if ($ENV{with_icu} ne 'yes')
 }
 
 my $node1 = PostgreSQL::Test::Cluster->new('node1');
-$node1->init;
+$node1->init(extra => ['--locale-provider=libc']);
 $node1->start;
 
 $node1->safe_psql('postgres',
-- 
2.34.1

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#12)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-14 09:48:08 -0800, Jeff Davis wrote:

On Fri, 2023-02-10 at 18:00 -0800, Andres Freund wrote:

But, shouldn't pg_upgrade be able to deal with this? As long as the
databases
are created with template0, we can create the collations at that
point?

Are you saying that the upgraded cluster could have a different default
collation for the template databases than the original cluster?

That would be wrong to do, at least by default, but I could see it
being a useful option.

Or maybe I misunderstand what you're saying?

I am saying that pg_upgrade should be able to deal with the difference. The
details of how to implement that, don't matter that much.

FWIW, I don't think it matters much what collation template0 has, since we
allow to change the locale provider when using template0 as the template.

We could easily update template0, if we think that's necessary. But I don't
think it really is. As long as the newly created databases have the right
provider, I'd lean towards not touching template0. But whatever...

Greetings,

Andres Freund

Jonathan S. Katz

jkatz@postgresql.org

almost 3 years ago

In reply to: Jeff Davis (#11)

Re: Move defaults toward ICU in 16?

On 2/13/23 8:11 PM, Jeff Davis wrote:

On Thu, 2023-02-02 at 05:13 -0800, Jeff Davis wrote:

As a project, do we want to nudge users toward ICU as the collation
provider as the best practice going forward?

One consideration here is security. Any vulnerability in ICU collation
routines could easily become a vulnerability in Postgres.

Would it be any different than a vulnerability in OpenSSL et al? I know
that's a general, nuanced question but it would be good to understand if
we are exposing ourselves to any more vulnerabilities. And would it be
any different than today, given people can build PG with libicu as is?

Continuing on $SUBJECT, I wanted to understand performance comparisons.
I saw your comments[1]/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com in response to Robert's question, looked at your
benchmarks[2]/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com and one that ICU ran on older versions[3]https://icu.unicode.org/charts/collation-icu4c48-glibc. It seems that
in general, users would see performance gains switching to ICU. The only
one in [3]https://icu.unicode.org/charts/collation-icu4c48-glibc that stood out to me was the tests on the "ko_KR" collation
underperformed on a list of Korean names, but maybe that is better in
newer versions.

I agree with most of your points in [1]/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com. The platform-consistent
behavior is a good point, especially with more PG deployments running on
different systems. While taking on a new dependency is a concern, ICU
was released in 1999[4]https://en.wikipedia.org/wiki/International_Components_for_Unicode, has an active community, and seems to follow
standards (i.e. the Unicode Consortium).

I do wonder about upgrades, beyond the ongoing work with pg_upgrade. I
think the logical methods (pg_dumpall, logical replication) should
generally be OK, but we should ensure we think of things that could go
wrong and how we'd answer them.

Based on the available data, I think it's OK to move towards ICU as the
default, or preferred, collation provider. I agree (for now) in not
taking a hard dependency on ICU.

Thanks,

Jonathan

[1]: /messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com
/messages/by-id/b676252eeb57ab8da9dbb411d0ccace95caeda0a.camel@j-davis.com
[2]: /messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com
[3]: https://icu.unicode.org/charts/collation-icu4c48-glibc
[4]: https://en.wikipedia.org/wiki/International_Components_for_Unicode

pgsql@j-davis.com

almost 3 years ago

In reply to: Jonathan S. Katz (#14)

Re: Move defaults toward ICU in 16?

On Tue, 2023-02-14 at 16:27 -0500, Jonathan S. Katz wrote:

Would it be any different than a vulnerability in OpenSSL et al?

In principle, no, but sometimes the details matter. I'm just trying to
add data to the discussion.

It seems that
in general, users would see performance gains switching to ICU.

That's great news, and consistent with my experience. I don't think it
should be a driving factor though. If there's a choice is between
platform-independent semantics (ICU) and performance, platform-
independence should be the default.

I agree with most of your points in [1]. The platform-consistent
behavior is a good point, especially with more PG deployments running
on
different systems.

Now I think semantics are the most important driver, being consistent
across platforms and based on some kind of trusted independent
organization that we can point to.

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

Many users have other systems in their data infrastructure, running on
a variety of platforms, and could (in theory) try to synchronize around
a common ICU version to avoid subtle bugs in their data pipeline.

Based on the available data, I think it's OK to move towards ICU as
the
default, or preferred, collation provider. I agree (for now) in not
taking a hard dependency on ICU.

I count several favorable responses, so I'll take it that we (as a
community) are intending to change the default for build and initdb in
v16.

Robert expressed some skepticism[1]/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com, though I don't see an objection.
If I read his concerns correctly, he's mainly concerned with quality
issues like documentaiton, bugs, etc. I understand those concerns (I'm
the one that raised them), but they seem like the kind of issues that
one finds any time they dig into a dependency enough. "Setting our
sights very high"[1]/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com, to me, would just be ICU with a bit more rigorous
attention to quality issues.

[1]: /messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com
/messages/by-id/CA+TgmoYmeGJaW=Py9tAZtrnCP+_Q+zRQthv=zn_HyA_nqEDM-A@mail.gmail.com

--
Jeff Davis
PostgreSQL Contributor Team - AWS

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#15)

Re: Move defaults toward ICU in 16?

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

--
Robert Haas
EDB: http://www.enterprisedb.com

laurenz.albe@cybertec.at

almost 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

On Thu, 2023-02-16 at 15:05 +0530, Robert Haas wrote:

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

Unless I misunderstand, the lack of Windows support is not a matter
of principle and can be added later on, right?

I am in favor of changing the default. It might be good to add a section
to the documentation in "Server setup and operation" recommending that
if you go with the default choice of ICU, you should configure your
package manager not to upgrade the ICU library.

Yours,
Laurenz Albe

Jonathan S. Katz

jkatz@postgresql.org

almost 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

On 2/16/23 4:35 AM, Robert Haas wrote:

On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis <pgsql@j-davis.com> wrote:

It feels very wrong to me to explain that sort order is defined by the
operating system on which Postgres happens to run. Saying that it's
defined by ICU, which is part of the Unicode consotium, is much better.
It doesn't eliminate versioning issues, of course, but I think it's a
better explanation for users.

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

This is one reason why we're discussing ICU as the "preferred default"
vs. "the default." While it may not completely eliminate platform
dependent behavior for collations, it takes a step forward.

And AIUI, it does sound like ICU is available on newer versions of
Windows[1]https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu.

I feel like this is a very difficult kind of change to judge. If
everyone else feels this is a win, we should go with it, and hopefully
we'll end up better off. I do feel like there are things that could go
wrong, though, between the imperfect documentation, the fact that a
substantial chunk of our users won't be able to use it because they
run Windows, and everybody having to adjust to the behavior change.

We should continue to improve our documentation. Personally, I found the
biggest challenge was understanding how to set ICU locales / rules,
particularly for nondeterministic collations as it was challenging to
find where these were listed. I was able to overcome this with the
examples in our docs + blogs, but I agree it's an area we can continue
to improve upon.

Thanks,

Jonathan

[1]: https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu
https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu

andres@anarazel.de

almost 3 years ago

In reply to: Robert Haas (#16)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Greetings,

Andres Freund

robertmhaas@gmail.com

almost 3 years ago

In reply to: Andres Freund (#19)

Re: Move defaults toward ICU in 16?

On Thu, Feb 16, 2023 at 9:45 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

--
Robert Haas
EDB: http://www.enterprisedb.com

andres@anarazel.de

almost 3 years ago

In reply to: Robert Haas (#20)

Re: Move defaults toward ICU in 16?

Hi,

On February 16, 2023 9:14:05 PM PST, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 16, 2023 at 9:45 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-16 15:05:10 +0530, Robert Haas wrote:

The fact that we can't use ICU on Windows, though, weakens this
argument a lot. In my experience, we have a lot of Windows users, and
they're not any happier with the operating system collations than
Linux users. Possibly less so.

Why can't you use ICU on windows? It works today, afaict?

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

That was about the build environment in CI / cfbot, I think. Jeff was making icu a hard requirement by default, but ICU wasn't installed in a usable way, so the build failed. The patch he referred to was just building ICU during the CI run.

I do remember encountering issues with the mkvcbuild.pl build not building against a downloaded modern icu build, but that was just about library naming or directory structure, or such.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Michael Paquier

michael@paquier.xyz

almost 3 years ago

In reply to: Robert Haas (#20)

Re: Move defaults toward ICU in 16?

On Fri, Feb 17, 2023 at 10:44:05AM +0530, Robert Haas wrote:

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

IMO, it would be nice to be able to have the automatic detection of
meson work in the CFbot to see how this patch goes. Perhaps that's
not a reason enough to hold on this patch, though..

Separate question: what's the state of the Windows installers provided
by the community regarding libicu? Is that embedded in the MSI?
--
Michael

andres@anarazel.de

almost 3 years ago

In reply to: Michael Paquier (#22)

Re: Move defaults toward ICU in 16?

Hi,

On February 16, 2023 9:40:17 PM PST, Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Feb 17, 2023 at 10:44:05AM +0530, Robert Haas wrote:

Uh, I had the contrary impression from the discussion upthread, but it
sounds like I might be misunderstanding the situation?

IMO, it would be nice to be able to have the automatic detection of
meson work in the CFbot to see how this patch goes. Perhaps that's
not a reason enough to hold on this patch, though..

Fwiw, the manually triggered mingw task today builds with ICU support. One thing the patch could do is to just comment out the "manual" piece in .cirrus.yml, then cfbot would run it for just this cf entry.

I am planning to build the optional libraries that are easily built, as part of the image build for use by CI. Just haven't gotten around to it. The patch Jeff linked to is part of the experimentation on the way to that. If somebody else wants to finish that, even better. IIRC that prototype builds all optional dependencies except for kerberos and ossp-uuid.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#13)

Re: Move defaults toward ICU in 16?

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:

I am saying that pg_upgrade should be able to deal with the
difference. The
details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that of
the old cluster?

I'll submit it as a separate patch because it would be independently
useful.

Regards,
Jeff Davis

laurenz.albe@cybertec.at

almost 3 years ago

In reply to: Michael Paquier (#22)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 14:40 +0900, Michael Paquier wrote:

Separate question: what's the state of the Windows installers provided
by the community regarding libicu? Is that embedded in the MSI?

The EDB installer installs a quite old version of the ICU library
for compatibility reasons, as far as I know.

Yours,
Laurenz Albe

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#24)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:

I am saying that pg_upgrade should be able to deal with the
difference. The
details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that
of
the old cluster?

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
rather than checking for compatibility, why doesn't it just take over
and do the initdb for the new cluster itself? That would be less
confusing for users, and avoid some weirdness (like, if you drop the
database "postgres" on the original, why does it reappear after an
upgrade?).

Someone might want to do something interesting to the new cluster
before the upgrade, but it's not clear from the docs what would be both
useful and safe.

Regards,
Jeff Davis

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#26)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-17 09:01:54 -0800, Jeff Davis wrote:

On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:

I am saying that pg_upgrade should be able to deal with the
difference. The
details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that
of
the old cluster?

Yes.

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not.

I don't think we should consider changing the default collation provider
without making this more seamless, one way or another.

If pg_upgrade is fixing up the new cluster rather than checking for
compatibility, why doesn't it just take over and do the initdb for the new
cluster itself? That would be less confusing for users, and avoid some
weirdness (like, if you drop the database "postgres" on the original, why
does it reappear after an upgrade?).

I've wondered about that as well. There are some initdb-time options you can
set, but pg_upgrade could forward those.

Greetings,

Andres Freund

pavel.stehule@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#26)

Re: Move defaults toward ICU in 16?

pá 17. 2. 2023 v 18:02 odesílatel Jeff Davis <pgsql@j-davis.com> napsal:

On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:

I am saying that pg_upgrade should be able to deal with the
difference. The
details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that
of
the old cluster?

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
rather than checking for compatibility, why doesn't it just take over
and do the initdb for the new cluster itself? That would be less
confusing for users, and avoid some weirdness (like, if you drop the
database "postgres" on the original, why does it reappear after an
upgrade?).

Someone might want to do something interesting to the new cluster
before the upgrade, but it's not clear from the docs what would be both
useful and safe.

Today I tested icu for Czech sorting. It is a little bit slower, but not
too much, but it produces partially different results.

select row_number() over (order by nazev collate "cs-x-icu"), nazev from
obce
except select row_number() over (order by nazev collate "cs_CZ"), nazev
from obce;

returns a not empty set. So minimally for Czech collate, an index rebuild
is necessary

Regards

Pavel

Show quoted text

Regards,
Jeff Davis

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#27)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 09:05 -0800, Andres Freund wrote:

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not.

I don't think we should consider changing the default collation
provider
without making this more seamless, one way or another.

I guess I'm fine hacking pg_upgrade, but I think I'd like to make it
conditional on this specific case: only perform the fixup if the old
cluster is 15 or earlier and using libc and the newer cluster is 16 or
later and using icu.

There's already a check that the new cluster is empty, so I think it's
safe to hack the pg_database locale fields.

Regards,
Jeff Davis

Show quoted text

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#29)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-17 10:00:41 -0800, Jeff Davis wrote:

I guess I'm fine hacking pg_upgrade, but I think I'd like to make it
conditional on this specific case: only perform the fixup if the old
cluster is 15 or earlier and using libc and the newer cluster is 16 or
later and using icu.

-1. That's just going to cause pain one major version upgrade further down the
line. Why would we want to incur that pain?

There's already a check that the new cluster is empty, so I think it's
safe to hack the pg_database locale fields.

I don't think we need to, we do issue the CREATE DATABASEs. So we just need to
make sure that includes the collation provider info, and the proper template
database, in pg_upgrade mode.

Greetings,

Andres Freund

pryzby@telsasoft.com

almost 3 years ago

In reply to: Jeff Davis (#26)

Re: Move defaults toward ICU in 16?

On Fri, Feb 17, 2023 at 09:01:54AM -0800, Jeff Davis wrote:

On Fri, 2023-02-17 at 00:06 -0800, Jeff Davis wrote:

On Tue, 2023-02-14 at 09:59 -0800, Andres Freund wrote:

I am saying that pg_upgrade should be able to deal with the
difference. The
details of how to implement that, don't matter that much.

To clarify, you're saying that pg_upgrade should simply update
pg_database to set the new databases' collation fields equal to that
of
the old cluster?

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
rather than checking for compatibility, why doesn't it just take over
and do the initdb for the new cluster itself? That would be less
confusing for users, and avoid some weirdness (like, if you drop the
database "postgres" on the original, why does it reappear after an
upgrade?).

Someone might want to do something interesting to the new cluster
before the upgrade, but it's not clear from the docs what would be both
useful and safe.

This came up before - I'm of the opinion that it's unsupported and/or
useless to try to do anything on the new cluster between initdb and
pg_upgrade.

/messages/by-id/20220707184410.GB13040@telsasoft.com
/messages/by-id/20220905170322.GM31833@telsasoft.com

--
Justin

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#30)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 10:09 -0800, Andres Freund wrote:

-1. That's just going to cause pain one major version upgrade further
down the
line. Why would we want to incur that pain?

OK, we can just always do the fixup as long as the old one is libc and
the new one is ICU. I'm just trying to avoid this becoming a general
mechanism to fix up an incompatible new cluster.

There's already a check that the new cluster is empty, so I think
it's
safe to hack the pg_database locale fields.

I don't think we need to, we do issue the CREATE DATABASEs. So we
just need to
make sure that includes the collation provider info, and the proper
template
database, in pg_upgrade mode.

We must fixup template1/postgres in the new cluster (change it to libc
to match the old cluster), because any objects existing in those
databases in the old cluster may depend on the default collation. I
don't see how we can do that without updating pg_database, so I'm not
following your point.

(I think you're right that template0 is optional; but since we're
fixing up the other databases it would be less surprising if we also
fixed up template0.)

And if we do fixup template0/template1/postgres to match the old
cluster, then CREATE DATABASE will have no issue.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

pgsql@j-davis.com

almost 3 years ago

In reply to: Pavel Stehule (#28)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote:

Today I tested icu for Czech sorting. It is a little bit slower, but
not too much, but it produces partially different results.

Thank you for trying it.

If it's a significant slowdown, can you please send more information?
ICU version, libc version, and testcase?

select row_number() over (order by nazev collate "cs-x-icu"), nazev
from obce
except select row_number() over (order by nazev collate "cs_CZ"),
nazev from obce;

returns a not empty set. So minimally for Czech collate, an index
rebuild is necessary

Yes, that's true of any locale change, provider change, or even
provider version change.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

andres@anarazel.de

almost 3 years ago

In reply to: Jeff Davis (#32)

Re: Move defaults toward ICU in 16?

Hi,

On 2023-02-17 12:36:05 -0800, Jeff Davis wrote:

There's already a check that the new cluster is empty, so I think
it's
safe to hack the pg_database locale fields.

I don't think we need to, we do issue the CREATE DATABASEs. So we
just need to
make sure that includes the collation provider info, and the proper
template
database, in pg_upgrade mode.

We must fixup template1/postgres in the new cluster (change it to libc
to match the old cluster), because any objects existing in those
databases in the old cluster may depend on the default collation. I
don't see how we can do that without updating pg_database, so I'm not
following your point.

I think we just drop/recreate template1 and postgres during pg_upgrade. Yep,
looks like it. See create_new_objects():

/*
* template1 database will already exist in the target installation,
* so tell pg_restore to drop and recreate it; otherwise we would fail
* to propagate its database-level properties.
*/
create_opts = "--clean --create";

and then:

/*
* postgres database will already exist in the target installation, so
* tell pg_restore to drop and recreate it; otherwise we would fail to
* propagate its database-level properties.
*/
if (strcmp(old_db->db_name, "postgres") == 0)
create_opts = "--clean --create";
else
create_opts = "--create";

Greetings,

Andres Freund

pgsql@j-davis.com

almost 3 years ago

In reply to: Andres Freund (#34)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 12:50 -0800, Andres Freund wrote:

I think we just drop/recreate template1 and postgres during
pg_upgrade.

Thank you, that makes much more sense now.

I was confused because pg_upgrade loops through to check compatibility
with all the databases, which makes zero sense if it's going to drop
all of them except template0 anyway. The checks on template1/postgres
should be bypassed.

So the two approaches are:

1. Don't bother with locale provider compatibility checks at all (even
on template0). The emitted CREATE DATABASE statements already specify
the locale provider, so that will take care of everything except
template0. Maybe the locale provider of template0 shouldn't matter, but
some users might be surprised if it changes during upgrade. It also
raises some questions about the other properties of template0 like
encoding, lc_collate, and lc_ctype, which also don't matter a whole lot
(because they can all be changed when using template0 as a template).

2. Update the pg_database entry for template0. This has less potential
for surprise in case people are actually using template0 for a
template.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

pavel.stehule@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#33)

Re: Move defaults toward ICU in 16?

pá 17. 2. 2023 v 21:43 odesílatel Jeff Davis <pgsql@j-davis.com> napsal:

On Fri, 2023-02-17 at 18:27 +0100, Pavel Stehule wrote:

Today I tested icu for Czech sorting. It is a little bit slower, but
not too much, but it produces partially different results.

Thank you for trying it.

If it's a significant slowdown, can you please send more information?
ICU version, libc version, and testcase?

no - this slowdown is not significant - although 1% can looks too much -
but it is just two ms

It looks so libicu has little bit more expensive initialization, but the
execution is little bit faster

But when I try to repeat the measurements, the results are very unstable on
my desktop :-/

SELECT * FROM obce ORDER BY nazev LIMIT 10 // is faster with glibc little
bit
SELECT * FROM obce ORDER BY nazev // is faster with libicu

You can download dataset https://pgsql.cz/files/obce.sql

It is table of municipalities in czech republic (real names) - about 6000
rows

I use fedora 37 - so libicu 71.1, glibc 2.36

Regards

Pavel

Show quoted text

select row_number() over (order by nazev collate "cs-x-icu"), nazev
from obce
except select row_number() over (order by nazev collate "cs_CZ"),
nazev from obce;

returns a not empty set. So minimally for Czech collate, an index
rebuild is necessary

Yes, that's true of any locale change, provider change, or even
provider version change.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

robertmhaas@gmail.com

almost 3 years ago

In reply to: Jeff Davis (#26)

Re: Move defaults toward ICU in 16?

On Fri, Feb 17, 2023 at 10:32 PM Jeff Davis <pgsql@j-davis.com> wrote:

Thinking about this more, it's not clear to me if this would be in
scope for pg_upgrade or not. If pg_upgrade is fixing up the new cluster
rather than checking for compatibility, why doesn't it just take over
and do the initdb for the new cluster itself? That would be less
confusing for users, and avoid some weirdness (like, if you drop the
database "postgres" on the original, why does it reappear after an
upgrade?).

Someone might want to do something interesting to the new cluster
before the upgrade, but it's not clear from the docs what would be both
useful and safe.

I agree with all of this. I think it would be fantastic if pg_upgrade
did the initdb itself. It would be simple to make this optional
behavior, too: if the destination directory does not exist or is
empty, initdb into it, otherwise skip that. That might be too
automagical, so we could add add a --no-initdb option. If not
specified, the destination directory must either not exist or be
empty; else it must exist and look like a data directory.

I completely concur with the idea that doing something with the new
cluster before the upgrade is weird, and I don't think we should
encourage people to do it. Nevertheless, as the threads to which
Justin linked probably say, I'm not sure that it's a good idea to
completely slam the door shut on that option. If we did want to move
in that direction, though, having pg_upgrade do the initdb would be an
excellent first step. We could at some later time decide to remove the
--no-initdb option; or maybe we'll decide that it's good to keep it
for emergencies, which is my present bias. In any event, the resulting
system would be more usable and less error-prone than what we have
today.

--
Robert Haas
EDB: http://www.enterprisedb.com

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#33)

Re: Move defaults toward ICU in 16?

On 17.02.23 21:43, Jeff Davis wrote:

select row_number() over (order by nazev collate "cs-x-icu"), nazev
from obce
except select row_number() over (order by nazev collate "cs_CZ"),
nazev from obce;

returns a not empty set. So minimally for Czech collate, an index
rebuild is necessary

Yes, that's true of any locale change, provider change, or even
provider version change.

I'm confused. We are not going to try to change existing databases to a
different collation provider during pg_upgrade, are we?

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#38)

Re: Move defaults toward ICU in 16?

On Mon, 2023-02-20 at 15:55 +0100, Peter Eisentraut wrote:

I'm confused. We are not going to try to change existing databases
to a
different collation provider during pg_upgrade, are we?

No, certainly not.

I interpreted Pavel's comments as a comparison of ICU and libc in
general and not specific to this patch. Changing providers obviously
requires an index rebuild to be safe.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#35)

3 attachment(s)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-17 at 15:07 -0800, Jeff Davis wrote:

2. Update the pg_database entry for template0. This has less
potential
for surprise in case people are actually using template0 for a
template.

New patches attached.

0001: default autoconf to build with ICU (meson already uses 'auto')
0002: update template0 in new cluster (as described above)
0003: default initdb to use ICU

Updating template0, as in 0002, seems straightforward and unsurprising,
since only template0 is preserved and it was only initialized for the
purposes of upgrading. Also, template0 is not sensitive to locale
settings, and doesn't even have the datcollversion set. The patch
updates encoding, datlocprovider, datcollate, datctype, and
daticulocale on the new cluster. No doc update, because there are some
initdb settings (like checksums) which still need to be compatible
between the old and the new cluster.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v4-0001-Build-ICU-support-by-default.patchtext/x-patch; charset=UTF-8; name=v4-0001-Build-ICU-support-by-default.patchDownload

From 34fcb5ebfd0463a659170ea0d7599bd37d6c3e76 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Feb 2023 12:08:11 -0800
Subject: [PATCH v4 1/3] Build ICU support by default.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 .cirrus.yml                    |  1 +
 configure                      | 36 ++++++----------
 configure.ac                   |  8 +++-
 doc/src/sgml/installation.sgml | 76 +++++++++++++++++++---------------
 4 files changed, 61 insertions(+), 60 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index f212978752..34450a9c7b 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -751,6 +751,7 @@ task:
       time ./configure \
         --host=x86_64-w64-mingw32 \
         --enable-cassert \
+        --without-icu \
         CC="ccache x86_64-w64-mingw32-gcc" \
         CXX="ccache x86_64-w64-mingw32-g++"
       make -s -j${BUILD_JOBS} clean
diff --git a/configure b/configure
index e35769ea73..436c2446e8 100755
--- a/configure
+++ b/configure
@@ -1558,7 +1558,7 @@ Optional Packages:
                           set WAL block size in kB [8]
   --with-CC=CMD           set compiler (deprecated)
   --with-llvm             build with LLVM based JIT support
-  --with-icu              build with ICU support
+  --without-icu           build without ICU support
   --with-tcl              build Tcl modules (PL/Tcl)
   --with-tclconfig=DIR    tclConfig.sh is in DIR
   --with-perl             build Perl modules (PL/Perl)
@@ -8401,7 +8401,9 @@ $as_echo "#define USE_ICU 1" >>confdefs.h
   esac
 
 else
-  with_icu=no
+  with_icu=yes
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
 
 fi
 
@@ -8470,31 +8472,17 @@ fi
 	# Put the nasty error message in config.log where it belongs
 	echo "$ICU_PKG_ERRORS" >&5
 
-	as_fn_error $? "Package requirements (icu-uc icu-i18n) were not met:
-
-$ICU_PKG_ERRORS
-
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 elif test $pkg_failed = untried; then
         { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
 $as_echo "no" >&6; }
-	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
-
-To get pkg-config, see <http://pkg-config.freedesktop.org/>.
-See \`config.log' for more details" "$LINENO" 5; }
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 else
 	ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
 	ICU_LIBS=$pkg_cv_ICU_LIBS
diff --git a/configure.ac b/configure.ac
index af23c15cb2..215e32120f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -853,13 +853,17 @@ AC_SUBST(enable_thread_safety)
 # ICU
 #
 AC_MSG_CHECKING([whether to build with ICU support])
-PGAC_ARG_BOOL(with, icu, no, [build with ICU support],
+PGAC_ARG_BOOL(with, icu, yes, [build without ICU support],
               [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
 AC_MSG_RESULT([$with_icu])
 AC_SUBST(with_icu)
 
 if test "$with_icu" = yes; then
-  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n)
+  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n, [],
+    [AC_MSG_ERROR([ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support.])])
 fi
 
 #
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 0ed35d99e9..4081820eca 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -146,6 +146,35 @@ documentation.  See standalone-profile.xsl for details.
       <application>pg_restore</application>.
      </para>
     </listitem>
+
+    <listitem>
+     <para>
+      The ICU locale provider (see <xref linkend="locale-providers"/>) is used by default. If you don't want to use it then you must specify the <option>--without-icu</option> option to <filename>configure</filename>. Using this option disables support for ICU collation features (see <xref linkend="collation"/>).
+     </para>
+     <para>
+      ICU support requires the <productname>ICU4C</productname> package to be
+      installed.  The minimum required version of
+      <productname>ICU4C</productname> is currently 4.2.
+     </para>
+
+     <para>
+      By default,
+      <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
+      will be used to find the required compilation options.  This is
+      supported for <productname>ICU4C</productname> version 4.6 and later.
+      For older versions, or if <productname>pkg-config</productname> is not
+      available, the variables <envar>ICU_CFLAGS</envar> and
+      <envar>ICU_LIBS</envar> can be specified to
+      <filename>configure</filename>, like in this example:
+<programlisting>
+./configure ... ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
+</programlisting>
+      (If <productname>ICU4C</productname> is in the default search path
+      for the compiler, then you still need to specify nonempty strings in
+      order to avoid use of <productname>pkg-config</productname>, for
+      example, <literal>ICU_CFLAGS=' '</literal>.)
+     </para>
+    </listitem>
    </itemizedlist>
   </para>
 
@@ -926,40 +955,6 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
-      <varlistentry id="configure-option-with-icu">
-       <term><option>--with-icu</option></term>
-       <listitem>
-        <para>
-         Build with support for
-         the <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
-         library, enabling use of ICU collation
-         features<phrase condition="standalone-ignore"> (see
-         <xref linkend="collation"/>)</phrase>.
-         This requires the <productname>ICU4C</productname> package
-         to be installed.  The minimum required version
-         of <productname>ICU4C</productname> is currently 4.2.
-        </para>
-
-        <para>
-         By default,
-         <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
-         will be used to find the required compilation options.  This is
-         supported for <productname>ICU4C</productname> version 4.6 and later.
-         For older versions, or if <productname>pkg-config</productname> is
-         not available, the variables <envar>ICU_CFLAGS</envar>
-         and <envar>ICU_LIBS</envar> can be specified
-         to <filename>configure</filename>, like in this example:
-<programlisting>
-./configure ... --with-icu ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
-</programlisting>
-         (If <productname>ICU4C</productname> is in the default search path
-         for the compiler, then you still need to specify nonempty strings in
-         order to avoid use of <productname>pkg-config</productname>, for
-         example, <literal>ICU_CFLAGS=' '</literal>.)
-        </para>
-       </listitem>
-      </varlistentry>
-
       <varlistentry id="configure-with-llvm">
        <term><option>--with-llvm</option></term>
        <listitem>
@@ -1231,6 +1226,19 @@ build-postgresql:
 
      <variablelist>
 
+      <varlistentry id="configure-option-without-icu">
+       <term><option>--without-icu</option></term>
+       <listitem>
+        <para>
+         Build without support for the
+         <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
+         library, disabling the use of ICU collation features<phrase
+         condition="standalone-ignore"> (see <xref
+         linkend="collation"/>)</phrase>.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-without-readline">
        <term><option>--without-readline</option></term>
        <listitem>
-- 
2.34.1

v4-0002-pg_upgrade-copy-locale-and-encoding-information-t.patchtext/x-patch; charset=UTF-8; name=v4-0002-pg_upgrade-copy-locale-and-encoding-information-t.patchDownload

From d14def6ba7ad2bffa12e2e6aa7bac7946bdfb4d5 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 24 Feb 2023 08:55:11 -0800
Subject: [PATCH v4 2/3] pg_upgrade: copy locale and encoding information to
 new cluster.

Previously, pg_upgrade checked that the old and new clusters were
compatible, including the locale and encoding. But the new cluster was
just created, and only template0 from the new cluster will be
preserved (template1 and postgres are both recreated during the
upgrade process).

Because template0 is not sensitive to locale or encoding, just update
the pg_database entry to be the same as template0 from the old
cluster.

This commit makes it easier to change the default initdb locale or
encoding settings without causing needless incompatibilities.
---
 src/bin/pg_upgrade/check.c      | 160 --------------------------------
 src/bin/pg_upgrade/info.c       |  50 ++++++++++
 src/bin/pg_upgrade/pg_upgrade.c |  62 +++++++++++++
 src/bin/pg_upgrade/pg_upgrade.h |   1 +
 4 files changed, 113 insertions(+), 160 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 7cf68dc9af..b71b00be37 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -16,9 +16,6 @@
 #include "pg_upgrade.h"
 
 static void check_new_cluster_is_empty(void);
-static void check_databases_are_compatible(void);
-static void check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb);
-static bool equivalent_locale(int category, const char *loca, const char *locb);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
@@ -33,7 +30,6 @@ static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
 static void check_for_user_defined_encoding_conversions(ClusterInfo *cluster);
-static char *get_canonical_locale_name(int category, const char *locale);
 
 
 /*
@@ -194,7 +190,6 @@ check_new_cluster(void)
 	get_db_and_rel_infos(&new_cluster);
 
 	check_new_cluster_is_empty();
-	check_databases_are_compatible();
 
 	check_loadable_libraries();
 
@@ -349,94 +344,6 @@ check_cluster_compatibility(bool live_check)
 }
 
 
-/*
- * check_locale_and_encoding()
- *
- * Check that locale and encoding of a database in the old and new clusters
- * are compatible.
- */
-static void
-check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb)
-{
-	if (olddb->db_encoding != newdb->db_encoding)
-		pg_fatal("encodings for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 pg_encoding_to_char(olddb->db_encoding),
-				 pg_encoding_to_char(newdb->db_encoding));
-	if (!equivalent_locale(LC_COLLATE, olddb->db_collate, newdb->db_collate))
-		pg_fatal("lc_collate values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name, olddb->db_collate, newdb->db_collate);
-	if (!equivalent_locale(LC_CTYPE, olddb->db_ctype, newdb->db_ctype))
-		pg_fatal("lc_ctype values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name, olddb->db_ctype, newdb->db_ctype);
-	if (olddb->db_collprovider != newdb->db_collprovider)
-		pg_fatal("locale providers for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 collprovider_name(olddb->db_collprovider),
-				 collprovider_name(newdb->db_collprovider));
-	if ((olddb->db_iculocale == NULL && newdb->db_iculocale != NULL) ||
-		(olddb->db_iculocale != NULL && newdb->db_iculocale == NULL) ||
-		(olddb->db_iculocale != NULL && newdb->db_iculocale != NULL && strcmp(olddb->db_iculocale, newdb->db_iculocale) != 0))
-		pg_fatal("ICU locale values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 olddb->db_iculocale ? olddb->db_iculocale : "(null)",
-				 newdb->db_iculocale ? newdb->db_iculocale : "(null)");
-}
-
-/*
- * equivalent_locale()
- *
- * Best effort locale-name comparison.  Return false if we are not 100% sure
- * the locales are equivalent.
- *
- * Note: The encoding parts of the names are ignored. This function is
- * currently used to compare locale names stored in pg_database, and
- * pg_database contains a separate encoding field. That's compared directly
- * in check_locale_and_encoding().
- */
-static bool
-equivalent_locale(int category, const char *loca, const char *locb)
-{
-	const char *chara;
-	const char *charb;
-	char	   *canona;
-	char	   *canonb;
-	int			lena;
-	int			lenb;
-
-	/*
-	 * If the names are equal, the locales are equivalent. Checking this first
-	 * avoids calling setlocale() in the common case that the names are equal.
-	 * That's a good thing, if setlocale() is buggy, for example.
-	 */
-	if (pg_strcasecmp(loca, locb) == 0)
-		return true;
-
-	/*
-	 * Not identical. Canonicalize both names, remove the encoding parts, and
-	 * try again.
-	 */
-	canona = get_canonical_locale_name(category, loca);
-	chara = strrchr(canona, '.');
-	lena = chara ? (chara - canona) : strlen(canona);
-
-	canonb = get_canonical_locale_name(category, locb);
-	charb = strrchr(canonb, '.');
-	lenb = charb ? (charb - canonb) : strlen(canonb);
-
-	if (lena == lenb && pg_strncasecmp(canona, canonb, lena) == 0)
-	{
-		pg_free(canona);
-		pg_free(canonb);
-		return true;
-	}
-
-	pg_free(canona);
-	pg_free(canonb);
-	return false;
-}
-
-
 static void
 check_new_cluster_is_empty(void)
 {
@@ -460,35 +367,6 @@ check_new_cluster_is_empty(void)
 	}
 }
 
-/*
- * Check that every database that already exists in the new cluster is
- * compatible with the corresponding database in the old one.
- */
-static void
-check_databases_are_compatible(void)
-{
-	int			newdbnum;
-	int			olddbnum;
-	DbInfo	   *newdbinfo;
-	DbInfo	   *olddbinfo;
-
-	for (newdbnum = 0; newdbnum < new_cluster.dbarr.ndbs; newdbnum++)
-	{
-		newdbinfo = &new_cluster.dbarr.dbs[newdbnum];
-
-		/* Find the corresponding database in the old cluster */
-		for (olddbnum = 0; olddbnum < old_cluster.dbarr.ndbs; olddbnum++)
-		{
-			olddbinfo = &old_cluster.dbarr.dbs[olddbnum];
-			if (strcmp(newdbinfo->db_name, olddbinfo->db_name) == 0)
-			{
-				check_locale_and_encoding(olddbinfo, newdbinfo);
-				break;
-			}
-		}
-	}
-}
-
 /*
  * A previous run of pg_upgrade might have failed and the new cluster
  * directory recreated, but they might have forgotten to remove
@@ -1524,41 +1402,3 @@ check_for_user_defined_encoding_conversions(ClusterInfo *cluster)
 	else
 		check_ok();
 }
-
-
-/*
- * get_canonical_locale_name
- *
- * Send the locale name to the system, and hope we get back a canonical
- * version.  This should match the backend's check_locale() function.
- */
-static char *
-get_canonical_locale_name(int category, const char *locale)
-{
-	char	   *save;
-	char	   *res;
-
-	/* get the current setting, so we can restore it. */
-	save = setlocale(category, NULL);
-	if (!save)
-		pg_fatal("failed to get the current locale");
-
-	/* 'save' may be pointing at a modifiable scratch variable, so copy it. */
-	save = pg_strdup(save);
-
-	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
-
-	if (!res)
-		pg_fatal("failed to get system locale name for \"%s\"", locale);
-
-	res = pg_strdup(res);
-
-	/* restore old value. */
-	if (!setlocale(category, save))
-		pg_fatal("failed to restore old locale \"%s\"", save);
-
-	pg_free(save);
-
-	return res;
-}
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index c1399c09b9..0490edfc45 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -20,6 +20,7 @@ static void create_rel_filename_map(const char *old_data, const char *new_data,
 static void report_unmatched_relation(const RelInfo *rel, const DbInfo *db,
 									  bool is_new_db);
 static void free_db_and_rel_infos(DbInfoArr *db_arr);
+static void get_template0_info(ClusterInfo *cluster);
 static void get_db_infos(ClusterInfo *cluster);
 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
 static void free_rel_infos(RelInfoArr *rel_arr);
@@ -278,6 +279,7 @@ get_db_and_rel_infos(ClusterInfo *cluster)
 	if (cluster->dbarr.dbs != NULL)
 		free_db_and_rel_infos(&cluster->dbarr);
 
+	get_template0_info(cluster);
 	get_db_infos(cluster);
 
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
@@ -293,6 +295,54 @@ get_db_and_rel_infos(ClusterInfo *cluster)
 }
 
 
+/*
+ * Get information about template0, which will be copied from the old cluster
+ * to the new cluster.
+ */
+static void
+get_template0_info(ClusterInfo *cluster)
+{
+	PGconn		*conn = connectToServer(cluster, "template1");
+	DbInfo		*dbinfo;
+	PGresult	*dbres;
+	int			 i_datencoding;
+	int			 i_datlocprovider;
+	int			 i_datcollate;
+	int			 i_datctype;
+	int			 i_daticulocale;
+
+	dbres = executeQueryOrDie(conn,
+							  "SELECT encoding, datlocprovider, "
+							  "       datcollate, datctype, daticulocale "
+							  "FROM	pg_catalog.pg_database "
+							  "WHERE datname='template0'");
+
+	if (PQntuples(dbres) != 1)
+		pg_fatal("template0 not found");
+
+	cluster->template0 = pg_malloc(sizeof(DbInfo));
+	dbinfo = cluster->template0;
+
+	i_datencoding = PQfnumber(dbres, "encoding");
+	i_datlocprovider = PQfnumber(dbres, "datlocprovider");
+	i_datcollate = PQfnumber(dbres, "datcollate");
+	i_datctype = PQfnumber(dbres, "datctype");
+	i_daticulocale = PQfnumber(dbres, "daticulocale");
+
+	dbinfo->db_encoding = atoi(PQgetvalue(dbres, 0, i_datencoding));
+	dbinfo->db_collprovider = PQgetvalue(dbres, 0, i_datlocprovider)[0];
+	dbinfo->db_collate = pg_strdup(PQgetvalue(dbres, 0, i_datcollate));
+	dbinfo->db_ctype = pg_strdup(PQgetvalue(dbres, 0, i_datctype));
+	if (PQgetisnull(dbres, 0, i_daticulocale))
+		dbinfo->db_iculocale = NULL;
+	else
+		dbinfo->db_iculocale = pg_strdup(PQgetvalue(dbres, 0, i_daticulocale));
+
+	PQclear(dbres);
+	PQfinish(conn);
+}
+
+
 /*
  * get_db_infos()
  *
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index e5597d3105..bd005056d1 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -51,6 +51,7 @@
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
 
+static void set_locale_and_encoding(void);
 static void prepare_new_cluster(void);
 static void prepare_new_globals(void);
 static void create_new_objects(void);
@@ -139,6 +140,8 @@ main(int argc, char **argv)
 		   "Performing Upgrade\n"
 		   "------------------");
 
+	set_locale_and_encoding();
+
 	prepare_new_cluster();
 
 	stop_postmaster(false);
@@ -366,6 +369,65 @@ setup(char *argv0, bool *live_check)
 }
 
 
+/*
+ * Copy locale and encoding information into the new cluster's template0.
+ *
+ * We need to copy the encoding, datlocprovider, datcollate, datctype, and
+ * daticulocale. We don't need datcollversion because that's never set for
+ * template0.
+ */
+static void
+set_locale_and_encoding(void)
+{
+	PGconn		*conn_new_template1;
+	char		*datcollate_literal;
+	char		*datctype_literal;
+	char		*daticulocale_literal	= NULL;
+	DbInfo		*dbinfo = old_cluster.template0;
+
+	prep_status("Setting locale and encoding for new cluster");
+
+	/* escape literals with respect to new cluster */
+	conn_new_template1 = connectToServer(&new_cluster, "template1");
+
+	datcollate_literal = PQescapeLiteral(conn_new_template1,
+										 dbinfo->db_collate,
+										 strlen(dbinfo->db_collate));
+	datctype_literal = PQescapeLiteral(conn_new_template1,
+									   dbinfo->db_ctype,
+									   strlen(dbinfo->db_ctype));
+	if (dbinfo->db_iculocale)
+		daticulocale_literal = PQescapeLiteral(conn_new_template1,
+											   dbinfo->db_iculocale,
+											   strlen(dbinfo->db_iculocale));
+	else
+		daticulocale_literal = pg_strdup("NULL");
+
+	/* update template0 in new cluster */
+	PQclear(executeQueryOrDie(conn_new_template1,
+							  "UPDATE pg_catalog.pg_database "
+							  "  SET encoding = %u, "
+							  "      datlocprovider = '%c', "
+							  "      datcollate = %s, "
+							  "      datctype = %s, "
+							  "      daticulocale = %s "
+							  "  WHERE datname = 'template0' ",
+							  dbinfo->db_encoding,
+							  dbinfo->db_collprovider,
+							  datcollate_literal,
+							  datctype_literal,
+							  daticulocale_literal));
+
+	PQfreemem(datcollate_literal);
+	PQfreemem(datctype_literal);
+	PQfreemem(daticulocale_literal);
+
+	PQfinish(conn_new_template1);
+
+	check_ok();
+}
+
+
 static void
 prepare_new_cluster(void)
 {
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 5f2a116f23..5561fc0473 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -252,6 +252,7 @@ typedef enum
 typedef struct
 {
 	ControlData controldata;	/* pg_control information */
+	DbInfo	   *template0;		/* template0 info */
 	DbInfoArr	dbarr;			/* dbinfos array */
 	char	   *pgdata;			/* pathname for cluster's $PGDATA directory */
 	char	   *pgconfig;		/* pathname for cluster's config file
-- 
2.34.1

v4-0003-Use-ICU-by-default-at-initdb-time.patchtext/x-patch; charset=UTF-8; name=v4-0003-Use-ICU-by-default-at-initdb-time.patchDownload

From 82073ce781b7f12e210d4534d83840dab691196f Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 8 Feb 2023 12:06:26 -0800
Subject: [PATCH v4 3/3] Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 contrib/citext/expected/citext_utf8.out       |  4 +-
 contrib/citext/expected/citext_utf8_1.out     |  4 +-
 contrib/citext/sql/citext_utf8.sql            |  4 +-
 contrib/unaccent/expected/unaccent.out        |  9 +++
 contrib/unaccent/expected/unaccent_1.out      |  8 ++
 contrib/unaccent/sql/unaccent.sql             | 11 +++
 doc/src/sgml/ref/initdb.sgml                  | 48 +++++++-----
 src/bin/initdb/Makefile                       |  4 +-
 src/bin/initdb/initdb.c                       | 76 ++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                |  7 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  2 +-
 src/bin/scripts/t/020_createdb.pl             |  2 +-
 src/interfaces/ecpg/test/Makefile             |  3 -
 src/interfaces/ecpg/test/connect/test5.pgc    |  2 +-
 .../ecpg/test/expected/connect-test5.c        |  2 +-
 .../ecpg/test/expected/connect-test5.stderr   |  2 +-
 src/interfaces/ecpg/test/meson.build          |  1 -
 src/test/icu/t/010_database.pl                |  2 +-
 18 files changed, 149 insertions(+), 42 deletions(-)
 create mode 100644 contrib/unaccent/expected/unaccent_1.out

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 666b07ccec..85ce9c3b64 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 433e985349..60ddebd841 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -3,7 +3,9 @@
  * and a Unicode-aware locale.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index d068000b42..e9c504e764 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -4,7 +4,9 @@
  */
 
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/unaccent/expected/unaccent.out b/contrib/unaccent/expected/unaccent.out
index ee0ac71a1c..cef98ee60c 100644
--- a/contrib/unaccent/expected/unaccent.out
+++ b/contrib/unaccent/expected/unaccent.out
@@ -1,3 +1,12 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
 CREATE EXTENSION unaccent;
 -- must have a UTF8 database
 SELECT getdatabaseencoding();
diff --git a/contrib/unaccent/expected/unaccent_1.out b/contrib/unaccent/expected/unaccent_1.out
new file mode 100644
index 0000000000..0a4a3838ab
--- /dev/null
+++ b/contrib/unaccent/expected/unaccent_1.out
@@ -0,0 +1,8 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 3fc0c706be..027dfb964a 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -1,3 +1,14 @@
+
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
 CREATE EXTENSION unaccent;
 
 -- must have a UTF8 database
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..4f37386ea3 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -89,10 +89,28 @@ PostgreSQL documentation
    and character set encoding. These can also be set separately for each
    database when it is created. <command>initdb</command> determines those
    settings for the template databases, which will serve as the default for
-   all other databases.  By default, <command>initdb</command> uses the
-   locale provider <literal>libc</literal>, takes the locale settings from
-   the environment, and determines the encoding from the locale settings.
-   This is almost always sufficient, unless there are special requirements.
+   all other databases.
+  </para>
+
+  <para>
+   By default, <command>initdb</command> uses the ICU library to provide
+   locale services if the server was built with ICU support; otherwise it uses
+   the <literal>libc</literal> locale provider (see <xref
+   linkend="locale-providers"/>). To choose the specific ICU locale ID to
+   apply, use the option <option>--icu-locale</option>.  Note that for
+   implementation reasons and to support legacy code,
+   <command>initdb</command> will still select and initialize libc locale
+   settings when the ICU locale provider is used.
+  </para>
+
+  <para>
+   Alternatively, <command>initdb</command> can use the locale provider
+   <literal>libc</literal>. To select this option, specify
+   <literal>--locale-provider=libc</literal>, or build the server without ICU
+   support. The <literal>libc</literal> locale provider takes the locale
+   settings from the environment, and determines the encoding from the locale
+   settings.  This is almost always sufficient, unless there are special
+   requirements.
   </para>
 
   <para>
@@ -103,17 +121,6 @@ PostgreSQL documentation
    categories can give nonsensical results, so this should be used with care.
   </para>
 
-  <para>
-   Alternatively, the ICU library can be used to provide locale services.
-   (Again, this only sets the default for subsequently created databases.)  To
-   select this option, specify <literal>--locale-provider=icu</literal>.
-   To choose the specific ICU locale ID to apply, use the option
-   <option>--icu-locale</option>.  Note that
-   for implementation reasons and to support legacy code,
-   <command>initdb</command> will still select and initialize libc locale
-   settings when the ICU locale provider is used.
-  </para>
-
   <para>
    When <command>initdb</command> runs, it will print out the locale settings
    it has chosen.  If you have complex requirements or specified multiple
@@ -234,7 +241,8 @@ PostgreSQL documentation
       <term><option>--icu-locale=<replaceable>locale</replaceable></option></term>
       <listitem>
        <para>
-        Specifies the ICU locale ID, if the ICU locale provider is used.
+        Specifies the ICU locale ID, if the ICU locale provider is used. By
+        default, ICU obtains the ICU locale from the ICU default collator.
        </para>
       </listitem>
      </varlistentry>
@@ -297,10 +305,12 @@ PostgreSQL documentation
       <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
       <listitem>
        <para>
-        This option sets the locale provider for databases created in the
-        new cluster.  It can be overridden in the <command>CREATE
+        This option sets the locale provider for databases created in the new
+        cluster.  It can be overridden in the <command>CREATE
         DATABASE</command> command when new databases are subsequently
-        created.  The default is <literal>libc</literal>.
+        created.  The default is <literal>icu</literal> if the server was
+        built with ICU support; otherwise the default is
+        <literal>libc</literal> (see <xref linkend="locale-providers"/>).
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index eab89c5501..d69bd89572 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -16,7 +16,7 @@ subdir = src/bin/initdb
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
 
 # Note: it's important that we link to encnames.o from libpgcommon, not
 # from libpq, else we have risks of version skew if we run with a libpq
@@ -24,7 +24,7 @@ override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
 # should ensure that that happens.
 #
 # We need libpq only because fe_utils does.
-LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) $(ICU_LIBS)
 
 # use system timezone data?
 ifneq (,$(with_system_tzdata))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..0776294499 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -53,6 +53,9 @@
 #include <netdb.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
 #include <unistd.h>
 #include <signal.h>
 #include <time.h>
@@ -133,7 +136,11 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+#ifdef USE_ICU
+static char locale_provider = COLLPROVIDER_ICU;
+#else
 static char locale_provider = COLLPROVIDER_LIBC;
+#endif
 static char *icu_locale = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
@@ -2024,6 +2031,72 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Check that ICU accepts the locale name; or if not specified, retrieve the
+ * default ICU locale.
+ */
+static void
+check_icu_locale()
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+	{
+		if (icu_locale)
+			pg_fatal("could not open collator for locale \"%s\": %s",
+					 icu_locale, u_errorName(status));
+		else
+			pg_fatal("could not open collator for default locale: %s",
+					 u_errorName(status));
+	}
+
+	/* if not specified, get locale from default collator */
+	if (icu_locale == NULL)
+	{
+		const char	*default_locale;
+
+		status = U_ZERO_ERROR;
+		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+											  &status);
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			pg_fatal("could not determine default ICU locale");
+		}
+
+		if (U_ICU_VERSION_MAJOR_NUM >= 54)
+		{
+			const bool	 strict = true;
+			char		*langtag;
+			int			 len;
+
+			len = uloc_toLanguageTag(default_locale, NULL, 0, strict, &status);
+			langtag = pg_malloc(len + 1);
+			status = U_ZERO_ERROR;
+			uloc_toLanguageTag(default_locale, langtag, len + 1, strict,
+							   &status);
+
+			if (U_FAILURE(status))
+			{
+				ucol_close(collator);
+				pg_fatal("could not determine language tag for default locale \"%s\": %s",
+						 default_locale, u_errorName(status));
+			}
+
+			icu_locale = langtag;
+		}
+		else
+			icu_locale = pg_strdup(default_locale);
+	}
+
+	ucol_close(collator);
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2077,8 +2150,7 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		if (!icu_locale)
-			pg_fatal("ICU locale must be specified");
+		check_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 772769acab..e5d214e09c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -97,11 +97,6 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
-		[ 'initdb', '--no-sync', '--locale-provider=icu', "$tempdir/data2" ],
-		qr/initdb: error: ICU locale must be specified/,
-		'locale provider ICU requires --icu-locale');
-
 	command_ok(
 		[
 			'initdb',                '--no-sync',
@@ -116,7 +111,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu', '--icu-locale=@colNumeric=lower',
 			"$tempdir/dataX"
 		],
-		qr/FATAL:  could not open collator for locale/,
+		qr/error: could not open collator for locale/,
 		'fails for invalid ICU locale');
 
 	command_fails_like(
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 72b19ee6cd..0808585866 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1758,7 +1758,7 @@ my %tests = (
 		create_sql =>
 		  "CREATE DATABASE dump_test2 LOCALE = 'C' TEMPLATE = template0;",
 		regexp => qr/^
-			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C';\E
+			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C'\E
 			/xm,
 		like => { pg_dumpall_dbprivs => 1, },
 	},
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 3ad4fbb00c..8ec58cdd64 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -13,7 +13,7 @@ program_version_ok('createdb');
 program_options_handling_ok('createdb');
 
 my $node = PostgreSQL::Test::Cluster->new('main');
-$node->init;
+$node->init(extra => ['--locale-provider=libc']);
 $node->start;
 
 $node->issues_sql_like(
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index d7a7d1d1ca..cf841a3a5b 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -14,9 +14,6 @@ override CPPFLAGS := \
 	'-DSHELLPROG="$(SHELL)"' \
 	$(CPPFLAGS)
 
-# default encoding for regression tests
-ENCODING = SQL_ASCII
-
 ifneq ($(build_os),mingw32)
 abs_builddir := $(shell pwd)
 else
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index de29160089..d512553677 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -55,7 +55,7 @@ exec sql end declare section;
 	exec sql connect to 'unix:postgresql://localhost/ecpg2_regression' as main user :user USING "connectpw";
 	exec sql disconnect main;
 
-	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=latin1 as main user regress_ecpg_user1/connectpw;
+	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=sql_ascii as main user regress_ecpg_user1/connectpw;
 	exec sql disconnect main;
 
 	exec sql connect to "unix:postgresql://200.46.204.71/ecpg2_regression" as main user regress_ecpg_user1/connectpw;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index c1124c627f..ec1514ed9a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -117,7 +117,7 @@ main(void)
 #line 56 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=sql_ascii" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 58 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.stderr b/src/interfaces/ecpg/test/expected/connect-test5.stderr
index 01a6a0a13b..51cc18916a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.stderr
+++ b/src/interfaces/ecpg/test/expected/connect-test5.stderr
@@ -50,7 +50,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=latin1 for user regress_ecpg_user1
+[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=sql_ascii for user regress_ecpg_user1
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
diff --git a/src/interfaces/ecpg/test/meson.build b/src/interfaces/ecpg/test/meson.build
index d0be73ccf9..04c6819a79 100644
--- a/src/interfaces/ecpg/test/meson.build
+++ b/src/interfaces/ecpg/test/meson.build
@@ -69,7 +69,6 @@ ecpg_test_files = files(
 ecpg_regress_args = [
   '--dbname=ecpg1_regression,ecpg2_regression',
   '--create-role=regress_ecpg_user1,regress_ecpg_user2',
-  '--encoding=SQL_ASCII',
 ]
 
 tests += {
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 80ab1c7789..45d77c319a 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -12,7 +12,7 @@ if ($ENV{with_icu} ne 'yes')
 }
 
 my $node1 = PostgreSQL::Test::Cluster->new('node1');
-$node1->init;
+$node1->init(extra => ['--locale-provider=libc']);
 $node1->start;
 
 $node1->safe_psql('postgres',
-- 
2.34.1

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#40)

Re: Move defaults toward ICU in 16?

On Fri, 2023-02-24 at 15:54 -0800, Jeff Davis wrote:

0001: default autoconf to build with ICU (meson already uses
'auto')

What's the best way to prepare for the impact of this on the buildfarm?
How should we migrate to using --without-icu for those animals not
currently specifying --with-icu?

Regards,
Jeff Davis

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Jeff Davis (#41)

Re: Move defaults toward ICU in 16?

Jeff Davis <pgsql@j-davis.com> writes:

On Fri, 2023-02-24 at 15:54 -0800, Jeff Davis wrote:

0001: default autoconf to build with ICU (meson already uses
'auto')

What's the best way to prepare for the impact of this on the buildfarm?
How should we migrate to using --without-icu for those animals not
currently specifying --with-icu?

Tell the buildfarm owners to add --without-icu to their config if
they don't have and don't want to install ICU. Wait a couple weeks.
Commit, then nag the owners whose machines turn red.

regards, tom lane

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#40)

Re: Move defaults toward ICU in 16?

On 25.02.23 00:54, Jeff Davis wrote:

On Fri, 2023-02-17 at 15:07 -0800, Jeff Davis wrote:

2. Update the pg_database entry for template0. This has less
potential
for surprise in case people are actually using template0 for a
template.

New patches attached.

0001: default autoconf to build with ICU (meson already uses 'auto')

I would skip this. There was a brief discussion about this at [0]/messages/by-id/534fed4a262fee534662bd07a691c5ef@postgrespro.ru,
where I pointed out that if we are going to do something like that,
there would be other candidates among the optional dependencies to
promote, such as certainly openssl and arguably lz4. If we don't do
this consistently across dependencies, then there will be confusion.

In practice, I don't think it matters. Almost all installations are
made by packagers, who will make their own choices. Flipping the
default in configure is only going to cause some amount of confusion and
annoyance in some places, but won't actually have the ostensibly desired
impact in practice.

[0]: /messages/by-id/534fed4a262fee534662bd07a691c5ef@postgrespro.ru
/messages/by-id/534fed4a262fee534662bd07a691c5ef@postgrespro.ru

0002: update template0 in new cluster (as described above)

This makes sense. I'm confused what the code originally wanted to
achieve, e.g.,

-/*
- * Check that every database that already exists in the new cluster is
- * compatible with the corresponding database in the old one.
- */
-static void
-check_databases_are_compatible(void)

Was there once support for the new cluster having additional databases
in place? Weird.

In any case, I think you can remove additional code from get_db_infos()
related to fields that are no longer used, such as db_encoding, and the
corresponding struct fields in DbInfo.

0003: default initdb to use ICU

What are the changes in the citext tests about? Is it the same issue as
in unaccent? In that case, the OR should be an AND? Maybe add a comment?

Why is unaccent is "broken" if the default collation is provided by ICU
and LC_CTYPE=C? Is that a general problem? Should we prevent this
combination?

What are the changes in the ecpg tests about? Looks harmless, but if
there is a need, maybe it should be commented somewhere, otherwise what
prevents someone from changing it back?

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#43)

3 attachment(s)

Re: Move defaults toward ICU in 16?

On Thu, 2023-03-02 at 10:37 +0100, Peter Eisentraut wrote:

I would skip this. There was a brief discussion about this at [0],
where I pointed out that if we are going to do something like that,
there would be other candidates among the optional dependencies to
promote, such as certainly openssl and arguably lz4. If we don't do
this consistently across dependencies, then there will be confusion.

The difference is that ICU affects semantics of collations, and
collations are not really an optional feature. If postgres is built
without ICU, that will affect the default at initdb time (after patch
003, anyway), which will then affect the default collations in all
databases.

In practice, I don't think it matters. Almost all installations are
made by packagers, who will make their own choices.

Mostly true, but the discussion at [0] reveals that some people do
build postgresql themselves for whatever reason.

When I first started out with postgres I always built from source. That
was quite a while ago, so maybe that means nothing; but it would be sad
to think that the build-it-yourself experience doesn't matter.

0002: update template0 in new cluster (as described above)

This makes sense. I'm confused what the code originally wanted to
achieve, e.g.,

-/*
- * Check that every database that already exists in the new cluster
is
- * compatible with the corresponding database in the old one.
- */
-static void
-check_databases_are_compatible(void)

Was there once support for the new cluster having additional
databases
in place? Weird.

It looks like 33755e8edf was the last significant change here. CC
Heikki for comment.

In any case, I think you can remove additional code from
get_db_infos()
related to fields that are no longer used, such as db_encoding, and
the
corresponding struct fields in DbInfo.

You're right: there's not much of an intersection between the code that
needs a DbInfo and the code that needs the locale fields. I created a
separate DbLocaleInfo struct for the template0 locale information, and
removed the locale fields from DbInfo.

I also added a TAP test.

0003: default initdb to use ICU

What are the changes in the citext tests about?

There's a test in citext_utf8.sql for:

SELECT 'i'::citext = 'İ'::citext AS t;

citext_eq uses DEFAULT_COLLATION_OID, comparing the results after
applying lower(). Apparently:

lower('İ' collate "en_US") = 'i' -- true
lower('İ' collate "tr-TR-x-icu") = 'i' -- true
lower('İ' collate "en-US-x-icu") = 'i' -- false

the test was passing before because it seems to be true for all libc
locales. But for ICU, it seems to only be true in the "tr-TR" locale at
colstrength=secondary (and true for other ICU locales at
colstrength=primary).

We can't fix the test by being explicit about the collation, because
citext doesn't pass it down; it always uses the default collation. We
could fix citext to pass it down properly, but that seems like a
different patch.

In any case, citext doesn't seem very important to those using ICU (we
have a doc note suggesting ICU instead), so I don't see a strong reason
to test the combination. So, I just exit the test early if it's ICU. I
added a better comment.

Is it the same issue as
in unaccent? In that case, the OR should be an AND? Maybe add a
comment?

Why is unaccent is "broken" if the default collation is provided by
ICU
and LC_CTYPE=C? Is that a general problem? Should we prevent this
combination?

A different issue: unaccent is calling t_isspace(), which is just not
properly handling locales. t_isspace() always passes NULL as the last
argument to char2wchar. There are TODO comments throughout that file.

Specifically what happens:
lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false
so it calls char2wchar(), which calls mbstowcs()
which returns an error because the LC_CTYPE=C

Right now, that's a longstanding issue for all users of t_isspace() and
related functions: tsearch, ltree, pg_trgm, dict_xsyn, and unaccent. I
assume it was known and considered unimportant, otherwise we wouldn't
have left the TODO comments in there.

I believe it's only a problem when the provider is ICU and the LC_CTYPE
is C. I think a quick fix would be to just test LC_CTYPE directly (from
the environment or setlocale(LC_CTYPE, NULL)) rather than try to
extract it from the default collation. It sounds like a separate patch,
and should be handled as a bugfix and backported.

A better fix would be to support character classification in ICU. I
don't think that's hard, but ICU has quite a few options, and we'd need
to discuss which are the right ones to support. We may also want to
pass collation information down rather than just using the database
default, but that may depend on the caller and we should discuss that,
as well.

What are the changes in the ecpg tests about? Looks harmless, but if
there is a need, maybe it should be commented somewhere, otherwise
what
prevents someone from changing it back?

ICU is not compatible with SQL_ASCII, so I had to remove the
ENCODING=SQL_ASCII line from the ecpg test build. CC Michael Meskes who
added the line in 1fa6be6f69 in case he has a comment.

But when I did that, I got CI failures on windows because it couldn't
transcode between LATIN1 and WIN1252. So I changed the ecpg test to
just use SQL_ASCII for the client_encoding (not the server encoding).
Michael Meskes added the client_encoding parameter test in 5e7710e725,
so he might have a comment about that as well.

Since I removed the code, I didn't see a clear place to add a comment,
but if you have a suggestion I'll take it.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v5-0003-Use-ICU-by-default-at-initdb-time.patchtext/x-patch; charset=UTF-8; name=v5-0003-Use-ICU-by-default-at-initdb-time.patchDownload

From 43bf9a98620c9c91a65b78f9f929a50c3d11efc5 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 8 Feb 2023 12:06:26 -0800
Subject: [PATCH v5 3/3] Use ICU by default at initdb time.

If the ICU locale is not specified, initialize the default collator
and retrieve the locale name from that.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 contrib/citext/expected/citext_utf8.out       |  9 ++-
 contrib/citext/expected/citext_utf8_1.out     |  9 ++-
 contrib/citext/sql/citext_utf8.sql            |  9 ++-
 contrib/unaccent/expected/unaccent.out        |  9 +++
 contrib/unaccent/expected/unaccent_1.out      |  8 ++
 contrib/unaccent/sql/unaccent.sql             | 11 +++
 doc/src/sgml/ref/initdb.sgml                  | 48 +++++++-----
 src/bin/initdb/Makefile                       |  4 +-
 src/bin/initdb/initdb.c                       | 76 ++++++++++++++++++-
 src/bin/initdb/t/001_initdb.pl                |  7 +-
 src/bin/pg_dump/t/002_pg_dump.pl              |  2 +-
 src/bin/scripts/t/020_createdb.pl             |  2 +-
 src/interfaces/ecpg/test/Makefile             |  3 -
 src/interfaces/ecpg/test/connect/test5.pgc    |  2 +-
 .../ecpg/test/expected/connect-test5.c        |  2 +-
 .../ecpg/test/expected/connect-test5.stderr   |  2 +-
 src/interfaces/ecpg/test/meson.build          |  1 -
 src/test/icu/t/010_database.pl                |  2 +-
 18 files changed, 164 insertions(+), 42 deletions(-)
 create mode 100644 contrib/unaccent/expected/unaccent_1.out

diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 666b07ccec..77b4586d8f 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -1,9 +1,16 @@
 /*
  * This test must be run in a database with UTF-8 encoding
  * and a Unicode-aware locale.
+ *
+ * Also disable this file for ICU, because the test for the the
+ * Turkish dotted I is not correct for many ICU locales. citext always
+ * uses the default collation, so it's not easy to restrict the test
+ * to the "tr-TR-x-icu" collation where it will succeed.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index 433e985349..d1e1fe1a9d 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -1,9 +1,16 @@
 /*
  * This test must be run in a database with UTF-8 encoding
  * and a Unicode-aware locale.
+ *
+ * Also disable this file for ICU, because the test for the the
+ * Turkish dotted I is not correct for many ICU locales. citext always
+ * uses the default collation, so it's not easy to restrict the test
+ * to the "tr-TR-x-icu" collation where it will succeed.
  */
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index d068000b42..8530c68dd7 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -1,10 +1,17 @@
 /*
  * This test must be run in a database with UTF-8 encoding
  * and a Unicode-aware locale.
+ *
+ * Also disable this file for ICU, because the test for the the
+ * Turkish dotted I is not correct for many ICU locales. citext always
+ * uses the default collation, so it's not easy to restrict the test
+ * to the "tr-TR-x-icu" collation where it will succeed.
  */
 
 SELECT getdatabaseencoding() <> 'UTF8' OR
-       current_setting('lc_ctype') = 'C'
+       current_setting('lc_ctype') = 'C' OR
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
        AS skip_test \gset
 \if :skip_test
 \quit
diff --git a/contrib/unaccent/expected/unaccent.out b/contrib/unaccent/expected/unaccent.out
index ee0ac71a1c..cef98ee60c 100644
--- a/contrib/unaccent/expected/unaccent.out
+++ b/contrib/unaccent/expected/unaccent.out
@@ -1,3 +1,12 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
 CREATE EXTENSION unaccent;
 -- must have a UTF8 database
 SELECT getdatabaseencoding();
diff --git a/contrib/unaccent/expected/unaccent_1.out b/contrib/unaccent/expected/unaccent_1.out
new file mode 100644
index 0000000000..0a4a3838ab
--- /dev/null
+++ b/contrib/unaccent/expected/unaccent_1.out
@@ -0,0 +1,8 @@
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/unaccent/sql/unaccent.sql b/contrib/unaccent/sql/unaccent.sql
index 3fc0c706be..027dfb964a 100644
--- a/contrib/unaccent/sql/unaccent.sql
+++ b/contrib/unaccent/sql/unaccent.sql
@@ -1,3 +1,14 @@
+
+-- unaccent is broken if the default collation is provided by ICU and
+-- LC_CTYPE=C
+SELECT current_setting('lc_ctype') = 'C' AND
+       (SELECT datlocprovider='i' FROM pg_database
+        WHERE datname=current_database())
+	AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
 CREATE EXTENSION unaccent;
 
 -- must have a UTF8 database
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 5b2bdac101..4f37386ea3 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -89,10 +89,28 @@ PostgreSQL documentation
    and character set encoding. These can also be set separately for each
    database when it is created. <command>initdb</command> determines those
    settings for the template databases, which will serve as the default for
-   all other databases.  By default, <command>initdb</command> uses the
-   locale provider <literal>libc</literal>, takes the locale settings from
-   the environment, and determines the encoding from the locale settings.
-   This is almost always sufficient, unless there are special requirements.
+   all other databases.
+  </para>
+
+  <para>
+   By default, <command>initdb</command> uses the ICU library to provide
+   locale services if the server was built with ICU support; otherwise it uses
+   the <literal>libc</literal> locale provider (see <xref
+   linkend="locale-providers"/>). To choose the specific ICU locale ID to
+   apply, use the option <option>--icu-locale</option>.  Note that for
+   implementation reasons and to support legacy code,
+   <command>initdb</command> will still select and initialize libc locale
+   settings when the ICU locale provider is used.
+  </para>
+
+  <para>
+   Alternatively, <command>initdb</command> can use the locale provider
+   <literal>libc</literal>. To select this option, specify
+   <literal>--locale-provider=libc</literal>, or build the server without ICU
+   support. The <literal>libc</literal> locale provider takes the locale
+   settings from the environment, and determines the encoding from the locale
+   settings.  This is almost always sufficient, unless there are special
+   requirements.
   </para>
 
   <para>
@@ -103,17 +121,6 @@ PostgreSQL documentation
    categories can give nonsensical results, so this should be used with care.
   </para>
 
-  <para>
-   Alternatively, the ICU library can be used to provide locale services.
-   (Again, this only sets the default for subsequently created databases.)  To
-   select this option, specify <literal>--locale-provider=icu</literal>.
-   To choose the specific ICU locale ID to apply, use the option
-   <option>--icu-locale</option>.  Note that
-   for implementation reasons and to support legacy code,
-   <command>initdb</command> will still select and initialize libc locale
-   settings when the ICU locale provider is used.
-  </para>
-
   <para>
    When <command>initdb</command> runs, it will print out the locale settings
    it has chosen.  If you have complex requirements or specified multiple
@@ -234,7 +241,8 @@ PostgreSQL documentation
       <term><option>--icu-locale=<replaceable>locale</replaceable></option></term>
       <listitem>
        <para>
-        Specifies the ICU locale ID, if the ICU locale provider is used.
+        Specifies the ICU locale ID, if the ICU locale provider is used. By
+        default, ICU obtains the ICU locale from the ICU default collator.
        </para>
       </listitem>
      </varlistentry>
@@ -297,10 +305,12 @@ PostgreSQL documentation
       <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
       <listitem>
        <para>
-        This option sets the locale provider for databases created in the
-        new cluster.  It can be overridden in the <command>CREATE
+        This option sets the locale provider for databases created in the new
+        cluster.  It can be overridden in the <command>CREATE
         DATABASE</command> command when new databases are subsequently
-        created.  The default is <literal>libc</literal>.
+        created.  The default is <literal>icu</literal> if the server was
+        built with ICU support; otherwise the default is
+        <literal>libc</literal> (see <xref linkend="locale-providers"/>).
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/bin/initdb/Makefile b/src/bin/initdb/Makefile
index eab89c5501..d69bd89572 100644
--- a/src/bin/initdb/Makefile
+++ b/src/bin/initdb/Makefile
@@ -16,7 +16,7 @@ subdir = src/bin/initdb
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
+override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(ICU_CFLAGS) $(CPPFLAGS)
 
 # Note: it's important that we link to encnames.o from libpgcommon, not
 # from libpq, else we have risks of version skew if we run with a libpq
@@ -24,7 +24,7 @@ override CPPFLAGS := -I$(libpq_srcdir) -I$(top_srcdir)/src/timezone $(CPPFLAGS)
 # should ensure that that happens.
 #
 # We need libpq only because fe_utils does.
-LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport) $(ICU_LIBS)
 
 # use system timezone data?
 ifneq (,$(with_system_tzdata))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 7a58c33ace..0776294499 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -53,6 +53,9 @@
 #include <netdb.h>
 #include <sys/socket.h>
 #include <sys/stat.h>
+#ifdef USE_ICU
+#include <unicode/ucol.h>
+#endif
 #include <unistd.h>
 #include <signal.h>
 #include <time.h>
@@ -133,7 +136,11 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+#ifdef USE_ICU
+static char locale_provider = COLLPROVIDER_ICU;
+#else
 static char locale_provider = COLLPROVIDER_LIBC;
+#endif
 static char *icu_locale = NULL;
 static const char *default_text_search_config = NULL;
 static char *username = NULL;
@@ -2024,6 +2031,72 @@ check_icu_locale_encoding(int user_enc)
 	return true;
 }
 
+/*
+ * Check that ICU accepts the locale name; or if not specified, retrieve the
+ * default ICU locale.
+ */
+static void
+check_icu_locale()
+{
+#ifdef USE_ICU
+	UCollator	*collator;
+	UErrorCode   status;
+
+	status = U_ZERO_ERROR;
+	collator = ucol_open(icu_locale, &status);
+	if (U_FAILURE(status))
+	{
+		if (icu_locale)
+			pg_fatal("could not open collator for locale \"%s\": %s",
+					 icu_locale, u_errorName(status));
+		else
+			pg_fatal("could not open collator for default locale: %s",
+					 u_errorName(status));
+	}
+
+	/* if not specified, get locale from default collator */
+	if (icu_locale == NULL)
+	{
+		const char	*default_locale;
+
+		status = U_ZERO_ERROR;
+		default_locale = ucol_getLocaleByType(collator, ULOC_VALID_LOCALE,
+											  &status);
+		if (U_FAILURE(status))
+		{
+			ucol_close(collator);
+			pg_fatal("could not determine default ICU locale");
+		}
+
+		if (U_ICU_VERSION_MAJOR_NUM >= 54)
+		{
+			const bool	 strict = true;
+			char		*langtag;
+			int			 len;
+
+			len = uloc_toLanguageTag(default_locale, NULL, 0, strict, &status);
+			langtag = pg_malloc(len + 1);
+			status = U_ZERO_ERROR;
+			uloc_toLanguageTag(default_locale, langtag, len + 1, strict,
+							   &status);
+
+			if (U_FAILURE(status))
+			{
+				ucol_close(collator);
+				pg_fatal("could not determine language tag for default locale \"%s\": %s",
+						 default_locale, u_errorName(status));
+			}
+
+			icu_locale = langtag;
+		}
+		else
+			icu_locale = pg_strdup(default_locale);
+	}
+
+	ucol_close(collator);
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2077,8 +2150,7 @@ setlocales(void)
 
 	if (locale_provider == COLLPROVIDER_ICU)
 	{
-		if (!icu_locale)
-			pg_fatal("ICU locale must be specified");
+		check_icu_locale();
 
 		/*
 		 * In supported builds, the ICU locale ID will be checked by the
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 772769acab..e5d214e09c 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -97,11 +97,6 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
-		[ 'initdb', '--no-sync', '--locale-provider=icu', "$tempdir/data2" ],
-		qr/initdb: error: ICU locale must be specified/,
-		'locale provider ICU requires --icu-locale');
-
 	command_ok(
 		[
 			'initdb',                '--no-sync',
@@ -116,7 +111,7 @@ if ($ENV{with_icu} eq 'yes')
 			'--locale-provider=icu', '--icu-locale=@colNumeric=lower',
 			"$tempdir/dataX"
 		],
-		qr/FATAL:  could not open collator for locale/,
+		qr/error: could not open collator for locale/,
 		'fails for invalid ICU locale');
 
 	command_fails_like(
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 187e4b8d07..9c354213ce 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1758,7 +1758,7 @@ my %tests = (
 		create_sql =>
 		  "CREATE DATABASE dump_test2 LOCALE = 'C' TEMPLATE = template0;",
 		regexp => qr/^
-			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C';\E
+			\QCREATE DATABASE dump_test2 \E.*\QLOCALE = 'C'\E
 			/xm,
 		like => { pg_dumpall_dbprivs => 1, },
 	},
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 3ad4fbb00c..8ec58cdd64 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -13,7 +13,7 @@ program_version_ok('createdb');
 program_options_handling_ok('createdb');
 
 my $node = PostgreSQL::Test::Cluster->new('main');
-$node->init;
+$node->init(extra => ['--locale-provider=libc']);
 $node->start;
 
 $node->issues_sql_like(
diff --git a/src/interfaces/ecpg/test/Makefile b/src/interfaces/ecpg/test/Makefile
index d7a7d1d1ca..cf841a3a5b 100644
--- a/src/interfaces/ecpg/test/Makefile
+++ b/src/interfaces/ecpg/test/Makefile
@@ -14,9 +14,6 @@ override CPPFLAGS := \
 	'-DSHELLPROG="$(SHELL)"' \
 	$(CPPFLAGS)
 
-# default encoding for regression tests
-ENCODING = SQL_ASCII
-
 ifneq ($(build_os),mingw32)
 abs_builddir := $(shell pwd)
 else
diff --git a/src/interfaces/ecpg/test/connect/test5.pgc b/src/interfaces/ecpg/test/connect/test5.pgc
index de29160089..d512553677 100644
--- a/src/interfaces/ecpg/test/connect/test5.pgc
+++ b/src/interfaces/ecpg/test/connect/test5.pgc
@@ -55,7 +55,7 @@ exec sql end declare section;
 	exec sql connect to 'unix:postgresql://localhost/ecpg2_regression' as main user :user USING "connectpw";
 	exec sql disconnect main;
 
-	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=latin1 as main user regress_ecpg_user1/connectpw;
+	exec sql connect to unix:postgresql://localhost/ecpg2_regression?connect_timeout=180&client_encoding=sql_ascii as main user regress_ecpg_user1/connectpw;
 	exec sql disconnect main;
 
 	exec sql connect to "unix:postgresql://200.46.204.71/ecpg2_regression" as main user regress_ecpg_user1/connectpw;
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.c b/src/interfaces/ecpg/test/expected/connect-test5.c
index c1124c627f..ec1514ed9a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.c
+++ b/src/interfaces/ecpg/test/expected/connect-test5.c
@@ -117,7 +117,7 @@ main(void)
 #line 56 "test5.pgc"
 
 
-	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=latin1" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
+	{ ECPGconnect(__LINE__, 0, "unix:postgresql://localhost/ecpg2_regression?connect_timeout=180 & client_encoding=sql_ascii" , "regress_ecpg_user1" , "connectpw" , "main", 0); }
 #line 58 "test5.pgc"
 
 	{ ECPGdisconnect(__LINE__, "main");}
diff --git a/src/interfaces/ecpg/test/expected/connect-test5.stderr b/src/interfaces/ecpg/test/expected/connect-test5.stderr
index 01a6a0a13b..51cc18916a 100644
--- a/src/interfaces/ecpg/test/expected/connect-test5.stderr
+++ b/src/interfaces/ecpg/test/expected/connect-test5.stderr
@@ -50,7 +50,7 @@
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
-[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=latin1 for user regress_ecpg_user1
+[NO_PID]: ECPGconnect: opening database ecpg2_regression on <DEFAULT> port <DEFAULT> with options connect_timeout=180 & client_encoding=sql_ascii for user regress_ecpg_user1
 [NO_PID]: sqlca: code: 0, state: 00000
 [NO_PID]: ecpg_finish: connection main closed
 [NO_PID]: sqlca: code: 0, state: 00000
diff --git a/src/interfaces/ecpg/test/meson.build b/src/interfaces/ecpg/test/meson.build
index d0be73ccf9..04c6819a79 100644
--- a/src/interfaces/ecpg/test/meson.build
+++ b/src/interfaces/ecpg/test/meson.build
@@ -69,7 +69,6 @@ ecpg_test_files = files(
 ecpg_regress_args = [
   '--dbname=ecpg1_regression,ecpg2_regression',
   '--create-role=regress_ecpg_user1,regress_ecpg_user2',
-  '--encoding=SQL_ASCII',
 ]
 
 tests += {
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 80ab1c7789..45d77c319a 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -12,7 +12,7 @@ if ($ENV{with_icu} ne 'yes')
 }
 
 my $node1 = PostgreSQL::Test::Cluster->new('node1');
-$node1->init;
+$node1->init(extra => ['--locale-provider=libc']);
 $node1->start;
 
 $node1->safe_psql('postgres',
-- 
2.34.1

v5-0002-pg_upgrade-copy-locale-and-encoding-information-t.patchtext/x-patch; charset=UTF-8; name=v5-0002-pg_upgrade-copy-locale-and-encoding-information-t.patchDownload

From 93dc706c706bc9799ec623c7b6bc32809e34bead Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 24 Feb 2023 08:55:11 -0800
Subject: [PATCH v5 2/3] pg_upgrade: copy locale and encoding information to
 new cluster.

Previously, pg_upgrade checked that the old and new clusters were
compatible, including the locale and encoding. But the new cluster was
just created, and only template0 from the new cluster will be
preserved (template1 and postgres are both recreated during the
upgrade process).

Because template0 is not sensitive to locale or encoding, just update
the pg_database entry to be the same as template0 from the old
cluster.

This commit makes it easier to change the default initdb locale or
encoding settings without causing needless incompatibilities.
---
 src/bin/pg_upgrade/Makefile            |   2 +
 src/bin/pg_upgrade/check.c             | 160 -------------------------
 src/bin/pg_upgrade/info.c              |  69 ++++++++---
 src/bin/pg_upgrade/meson.build         |   1 +
 src/bin/pg_upgrade/pg_upgrade.c        |  62 ++++++++++
 src/bin/pg_upgrade/pg_upgrade.h        |  12 +-
 src/bin/pg_upgrade/t/002_pg_upgrade.pl |  56 ++++++++-
 7 files changed, 180 insertions(+), 182 deletions(-)

diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index 7f8042f34a..5834513add 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -51,6 +51,8 @@ clean distclean maintainer-clean:
 	rm -rf delete_old_cluster.sh log/ tmp_check/ \
 	       reindex_hash.sql
 
+export with_icu
+
 check:
 	$(prove_check)
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 7cf68dc9af..b71b00be37 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -16,9 +16,6 @@
 #include "pg_upgrade.h"
 
 static void check_new_cluster_is_empty(void);
-static void check_databases_are_compatible(void);
-static void check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb);
-static bool equivalent_locale(int category, const char *loca, const char *locb);
 static void check_is_install_user(ClusterInfo *cluster);
 static void check_proper_datallowconn(ClusterInfo *cluster);
 static void check_for_prepared_transactions(ClusterInfo *cluster);
@@ -33,7 +30,6 @@ static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
 static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
 static void check_for_user_defined_encoding_conversions(ClusterInfo *cluster);
-static char *get_canonical_locale_name(int category, const char *locale);
 
 
 /*
@@ -194,7 +190,6 @@ check_new_cluster(void)
 	get_db_and_rel_infos(&new_cluster);
 
 	check_new_cluster_is_empty();
-	check_databases_are_compatible();
 
 	check_loadable_libraries();
 
@@ -349,94 +344,6 @@ check_cluster_compatibility(bool live_check)
 }
 
 
-/*
- * check_locale_and_encoding()
- *
- * Check that locale and encoding of a database in the old and new clusters
- * are compatible.
- */
-static void
-check_locale_and_encoding(DbInfo *olddb, DbInfo *newdb)
-{
-	if (olddb->db_encoding != newdb->db_encoding)
-		pg_fatal("encodings for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 pg_encoding_to_char(olddb->db_encoding),
-				 pg_encoding_to_char(newdb->db_encoding));
-	if (!equivalent_locale(LC_COLLATE, olddb->db_collate, newdb->db_collate))
-		pg_fatal("lc_collate values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name, olddb->db_collate, newdb->db_collate);
-	if (!equivalent_locale(LC_CTYPE, olddb->db_ctype, newdb->db_ctype))
-		pg_fatal("lc_ctype values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name, olddb->db_ctype, newdb->db_ctype);
-	if (olddb->db_collprovider != newdb->db_collprovider)
-		pg_fatal("locale providers for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 collprovider_name(olddb->db_collprovider),
-				 collprovider_name(newdb->db_collprovider));
-	if ((olddb->db_iculocale == NULL && newdb->db_iculocale != NULL) ||
-		(olddb->db_iculocale != NULL && newdb->db_iculocale == NULL) ||
-		(olddb->db_iculocale != NULL && newdb->db_iculocale != NULL && strcmp(olddb->db_iculocale, newdb->db_iculocale) != 0))
-		pg_fatal("ICU locale values for database \"%s\" do not match:  old \"%s\", new \"%s\"",
-				 olddb->db_name,
-				 olddb->db_iculocale ? olddb->db_iculocale : "(null)",
-				 newdb->db_iculocale ? newdb->db_iculocale : "(null)");
-}
-
-/*
- * equivalent_locale()
- *
- * Best effort locale-name comparison.  Return false if we are not 100% sure
- * the locales are equivalent.
- *
- * Note: The encoding parts of the names are ignored. This function is
- * currently used to compare locale names stored in pg_database, and
- * pg_database contains a separate encoding field. That's compared directly
- * in check_locale_and_encoding().
- */
-static bool
-equivalent_locale(int category, const char *loca, const char *locb)
-{
-	const char *chara;
-	const char *charb;
-	char	   *canona;
-	char	   *canonb;
-	int			lena;
-	int			lenb;
-
-	/*
-	 * If the names are equal, the locales are equivalent. Checking this first
-	 * avoids calling setlocale() in the common case that the names are equal.
-	 * That's a good thing, if setlocale() is buggy, for example.
-	 */
-	if (pg_strcasecmp(loca, locb) == 0)
-		return true;
-
-	/*
-	 * Not identical. Canonicalize both names, remove the encoding parts, and
-	 * try again.
-	 */
-	canona = get_canonical_locale_name(category, loca);
-	chara = strrchr(canona, '.');
-	lena = chara ? (chara - canona) : strlen(canona);
-
-	canonb = get_canonical_locale_name(category, locb);
-	charb = strrchr(canonb, '.');
-	lenb = charb ? (charb - canonb) : strlen(canonb);
-
-	if (lena == lenb && pg_strncasecmp(canona, canonb, lena) == 0)
-	{
-		pg_free(canona);
-		pg_free(canonb);
-		return true;
-	}
-
-	pg_free(canona);
-	pg_free(canonb);
-	return false;
-}
-
-
 static void
 check_new_cluster_is_empty(void)
 {
@@ -460,35 +367,6 @@ check_new_cluster_is_empty(void)
 	}
 }
 
-/*
- * Check that every database that already exists in the new cluster is
- * compatible with the corresponding database in the old one.
- */
-static void
-check_databases_are_compatible(void)
-{
-	int			newdbnum;
-	int			olddbnum;
-	DbInfo	   *newdbinfo;
-	DbInfo	   *olddbinfo;
-
-	for (newdbnum = 0; newdbnum < new_cluster.dbarr.ndbs; newdbnum++)
-	{
-		newdbinfo = &new_cluster.dbarr.dbs[newdbnum];
-
-		/* Find the corresponding database in the old cluster */
-		for (olddbnum = 0; olddbnum < old_cluster.dbarr.ndbs; olddbnum++)
-		{
-			olddbinfo = &old_cluster.dbarr.dbs[olddbnum];
-			if (strcmp(newdbinfo->db_name, olddbinfo->db_name) == 0)
-			{
-				check_locale_and_encoding(olddbinfo, newdbinfo);
-				break;
-			}
-		}
-	}
-}
-
 /*
  * A previous run of pg_upgrade might have failed and the new cluster
  * directory recreated, but they might have forgotten to remove
@@ -1524,41 +1402,3 @@ check_for_user_defined_encoding_conversions(ClusterInfo *cluster)
 	else
 		check_ok();
 }
-
-
-/*
- * get_canonical_locale_name
- *
- * Send the locale name to the system, and hope we get back a canonical
- * version.  This should match the backend's check_locale() function.
- */
-static char *
-get_canonical_locale_name(int category, const char *locale)
-{
-	char	   *save;
-	char	   *res;
-
-	/* get the current setting, so we can restore it. */
-	save = setlocale(category, NULL);
-	if (!save)
-		pg_fatal("failed to get the current locale");
-
-	/* 'save' may be pointing at a modifiable scratch variable, so copy it. */
-	save = pg_strdup(save);
-
-	/* set the locale with setlocale, to see if it accepts it. */
-	res = setlocale(category, locale);
-
-	if (!res)
-		pg_fatal("failed to get system locale name for \"%s\"", locale);
-
-	res = pg_strdup(res);
-
-	/* restore old value. */
-	if (!setlocale(category, save))
-		pg_fatal("failed to restore old locale \"%s\"", save);
-
-	pg_free(save);
-
-	return res;
-}
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index c1399c09b9..33b10aac3c 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -20,6 +20,7 @@ static void create_rel_filename_map(const char *old_data, const char *new_data,
 static void report_unmatched_relation(const RelInfo *rel, const DbInfo *db,
 									  bool is_new_db);
 static void free_db_and_rel_infos(DbInfoArr *db_arr);
+static void get_template0_info(ClusterInfo *cluster);
 static void get_db_infos(ClusterInfo *cluster);
 static void get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo);
 static void free_rel_infos(RelInfoArr *rel_arr);
@@ -278,6 +279,7 @@ get_db_and_rel_infos(ClusterInfo *cluster)
 	if (cluster->dbarr.dbs != NULL)
 		free_db_and_rel_infos(&cluster->dbarr);
 
+	get_template0_info(cluster);
 	get_db_infos(cluster);
 
 	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
@@ -293,6 +295,55 @@ get_db_and_rel_infos(ClusterInfo *cluster)
 }
 
 
+/*
+ * Get information about template0, which will be copied from the old cluster
+ * to the new cluster.
+ */
+static void
+get_template0_info(ClusterInfo *cluster)
+{
+	PGconn			*conn = connectToServer(cluster, "template1");
+	DbLocaleInfo	*locale;
+	PGresult		*dbres;
+	int				 i_datencoding;
+	int				 i_datlocprovider;
+	int				 i_datcollate;
+	int				 i_datctype;
+	int				 i_daticulocale;
+
+	dbres = executeQueryOrDie(conn,
+							  "SELECT encoding, datlocprovider, "
+							  "       datcollate, datctype, daticulocale "
+							  "FROM	pg_catalog.pg_database "
+							  "WHERE datname='template0'");
+
+	if (PQntuples(dbres) != 1)
+		pg_fatal("template0 not found");
+
+	locale = pg_malloc(sizeof(DbLocaleInfo));
+
+	i_datencoding = PQfnumber(dbres, "encoding");
+	i_datlocprovider = PQfnumber(dbres, "datlocprovider");
+	i_datcollate = PQfnumber(dbres, "datcollate");
+	i_datctype = PQfnumber(dbres, "datctype");
+	i_daticulocale = PQfnumber(dbres, "daticulocale");
+
+	locale->db_encoding = atoi(PQgetvalue(dbres, 0, i_datencoding));
+	locale->db_collprovider = PQgetvalue(dbres, 0, i_datlocprovider)[0];
+	locale->db_collate = pg_strdup(PQgetvalue(dbres, 0, i_datcollate));
+	locale->db_ctype = pg_strdup(PQgetvalue(dbres, 0, i_datctype));
+	if (PQgetisnull(dbres, 0, i_daticulocale))
+		locale->db_iculocale = NULL;
+	else
+		locale->db_iculocale = pg_strdup(PQgetvalue(dbres, 0, i_daticulocale));
+
+	cluster->template0 = locale;
+
+	PQclear(dbres);
+	PQfinish(conn);
+}
+
+
 /*
  * get_db_infos()
  *
@@ -309,11 +360,6 @@ get_db_infos(ClusterInfo *cluster)
 	DbInfo	   *dbinfos;
 	int			i_datname,
 				i_oid,
-				i_encoding,
-				i_datcollate,
-				i_datctype,
-				i_datlocprovider,
-				i_daticulocale,
 				i_spclocation;
 	char		query[QUERY_ALLOC];
 
@@ -337,11 +383,6 @@ get_db_infos(ClusterInfo *cluster)
 
 	i_oid = PQfnumber(res, "oid");
 	i_datname = PQfnumber(res, "datname");
-	i_encoding = PQfnumber(res, "encoding");
-	i_datcollate = PQfnumber(res, "datcollate");
-	i_datctype = PQfnumber(res, "datctype");
-	i_datlocprovider = PQfnumber(res, "datlocprovider");
-	i_daticulocale = PQfnumber(res, "daticulocale");
 	i_spclocation = PQfnumber(res, "spclocation");
 
 	ntups = PQntuples(res);
@@ -351,14 +392,6 @@ get_db_infos(ClusterInfo *cluster)
 	{
 		dbinfos[tupnum].db_oid = atooid(PQgetvalue(res, tupnum, i_oid));
 		dbinfos[tupnum].db_name = pg_strdup(PQgetvalue(res, tupnum, i_datname));
-		dbinfos[tupnum].db_encoding = atoi(PQgetvalue(res, tupnum, i_encoding));
-		dbinfos[tupnum].db_collate = pg_strdup(PQgetvalue(res, tupnum, i_datcollate));
-		dbinfos[tupnum].db_ctype = pg_strdup(PQgetvalue(res, tupnum, i_datctype));
-		dbinfos[tupnum].db_collprovider = PQgetvalue(res, tupnum, i_datlocprovider)[0];
-		if (PQgetisnull(res, tupnum, i_daticulocale))
-			dbinfos[tupnum].db_iculocale = NULL;
-		else
-			dbinfos[tupnum].db_iculocale = pg_strdup(PQgetvalue(res, tupnum, i_daticulocale));
 		snprintf(dbinfos[tupnum].db_tablespace, sizeof(dbinfos[tupnum].db_tablespace), "%s",
 				 PQgetvalue(res, tupnum, i_spclocation));
 	}
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index 18c27b4e72..12a97f84e2 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -38,6 +38,7 @@ tests += {
   'sd': meson.current_source_dir(),
   'bd': meson.current_build_dir(),
   'tap': {
+    'env': {'with_icu': icu.found() ? 'yes' : 'no'},
     'tests': [
       't/001_basic.pl',
       't/002_pg_upgrade.pl',
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index e5597d3105..8c6009151f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -51,6 +51,7 @@
 #include "fe_utils/string_utils.h"
 #include "pg_upgrade.h"
 
+static void set_locale_and_encoding(void);
 static void prepare_new_cluster(void);
 static void prepare_new_globals(void);
 static void create_new_objects(void);
@@ -139,6 +140,8 @@ main(int argc, char **argv)
 		   "Performing Upgrade\n"
 		   "------------------");
 
+	set_locale_and_encoding();
+
 	prepare_new_cluster();
 
 	stop_postmaster(false);
@@ -366,6 +369,65 @@ setup(char *argv0, bool *live_check)
 }
 
 
+/*
+ * Copy locale and encoding information into the new cluster's template0.
+ *
+ * We need to copy the encoding, datlocprovider, datcollate, datctype, and
+ * daticulocale. We don't need datcollversion because that's never set for
+ * template0.
+ */
+static void
+set_locale_and_encoding(void)
+{
+	PGconn		*conn_new_template1;
+	char		*datcollate_literal;
+	char		*datctype_literal;
+	char		*daticulocale_literal	= NULL;
+	DbLocaleInfo *locale = old_cluster.template0;
+
+	prep_status("Setting locale and encoding for new cluster");
+
+	/* escape literals with respect to new cluster */
+	conn_new_template1 = connectToServer(&new_cluster, "template1");
+
+	datcollate_literal = PQescapeLiteral(conn_new_template1,
+										 locale->db_collate,
+										 strlen(locale->db_collate));
+	datctype_literal = PQescapeLiteral(conn_new_template1,
+									   locale->db_ctype,
+									   strlen(locale->db_ctype));
+	if (locale->db_iculocale)
+		daticulocale_literal = PQescapeLiteral(conn_new_template1,
+											   locale->db_iculocale,
+											   strlen(locale->db_iculocale));
+	else
+		daticulocale_literal = pg_strdup("NULL");
+
+	/* update template0 in new cluster */
+	PQclear(executeQueryOrDie(conn_new_template1,
+							  "UPDATE pg_catalog.pg_database "
+							  "  SET encoding = %u, "
+							  "      datlocprovider = '%c', "
+							  "      datcollate = %s, "
+							  "      datctype = %s, "
+							  "      daticulocale = %s "
+							  "  WHERE datname = 'template0' ",
+							  locale->db_encoding,
+							  locale->db_collprovider,
+							  datcollate_literal,
+							  datctype_literal,
+							  daticulocale_literal));
+
+	PQfreemem(datcollate_literal);
+	PQfreemem(datctype_literal);
+	PQfreemem(daticulocale_literal);
+
+	PQfinish(conn_new_template1);
+
+	check_ok();
+}
+
+
 static void
 prepare_new_cluster(void)
 {
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 5f2a116f23..3eea0139c7 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -175,13 +175,20 @@ typedef struct
 	char	   *db_name;		/* database name */
 	char		db_tablespace[MAXPGPATH];	/* database default tablespace
 											 * path */
+	RelInfoArr	rel_arr;		/* array of all user relinfos */
+} DbInfo;
+
+/*
+ * Locale information about a database.
+ */
+typedef struct
+{
 	char	   *db_collate;
 	char	   *db_ctype;
 	char		db_collprovider;
 	char	   *db_iculocale;
 	int			db_encoding;
-	RelInfoArr	rel_arr;		/* array of all user relinfos */
-} DbInfo;
+} DbLocaleInfo;
 
 typedef struct
 {
@@ -252,6 +259,7 @@ typedef enum
 typedef struct
 {
 	ControlData controldata;	/* pg_control information */
+	DbLocaleInfo *template0;	/* template0 locale info */
 	DbInfoArr	dbarr;			/* dbinfos array */
 	char	   *pgdata;			/* pathname for cluster's $PGDATA directory */
 	char	   *pgconfig;		/* pathname for cluster's config file
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 62a8fa9d8b..e6990aafc4 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -90,15 +90,44 @@ my $oldnode =
   PostgreSQL::Test::Cluster->new('old_node',
 	install_path => $ENV{oldinstall});
 
+my %node_params = ();
+
 # To increase coverage of non-standard segment size and group access without
 # increasing test runtime, run these tests with a custom setting.
 # --allow-group-access and --wal-segsize have been added in v11.
-my %node_params = ();
-$node_params{extra} = [ '--wal-segsize', '1', '--allow-group-access' ]
+my $nonstandard = [];
+$nonstandard = ['--wal-segsize', '1', '--allow-group-access']
   if $oldnode->pg_version >= 11;
+$node_params{extra} = $nonstandard;
+
+# Test that pg_upgrade copies the locale settings of template0 from
+# the old to the new cluster.
+push(@{$node_params{extra}}, ('--encoding', 'UTF-8'));
+push(@{$node_params{extra}}, ('--locale-provider', 'icu'))
+  if $oldnode->pg_version >= 15 && $ENV{with_icu} eq 'yes';
+
 $oldnode->init(%node_params);
 $oldnode->start;
 
+my $result;
+
+my $original_encoding = "6"; # UTF-8
+$result = $oldnode->safe_psql(
+	'postgres', q{SELECT encoding FROM pg_database WHERE datname='template0'});
+is($result, $original_encoding,
+		"expected template0 in old cluster to have encoding '$original_encoding'"
+	);
+
+my $original_provider = "c";
+$original_provider = "i"
+  if $oldnode->pg_version >= 15 && $ENV{with_icu} eq 'yes';
+
+$result = $oldnode->safe_psql(
+	'postgres', q{SELECT datlocprovider FROM pg_database WHERE datname='template0'});
+is($result, $original_provider,
+		"expected template0 in old cluster to have locale provider '$original_provider'"
+	);
+
 # The default location of the source code is the root of this directory.
 my $srcdir = abs_path("../../..");
 
@@ -168,6 +197,16 @@ else
 
 # Initialize a new node for the upgrade.
 my $newnode = PostgreSQL::Test::Cluster->new('new_node');
+
+# Reset to original parameters.
+$node_params{extra} = $nonstandard;
+
+# The new cluster will be initialized with locale provider libc and
+# encoding SQL_ASCII, but these settings will be overwritten with
+# those of the old cluster.
+push(@{$node_params{extra}}, ('--encoding', 'SQL_ASCII'));
+push(@{$node_params{extra}}, ('--locale-provider', 'libc'));
+
 $newnode->init(%node_params);
 
 my $newbindir = $newnode->config_data('--bindir');
@@ -338,6 +377,19 @@ if (-d $log_path)
 	}
 }
 
+# Test that upgraded cluster has original locale settings.
+$result = $newnode->safe_psql(
+	'postgres', q{SELECT encoding FROM pg_database WHERE datname='template0'});
+is($result, $original_encoding,
+		"expected template0 in upgraded cluster to have encoding '$original_encoding'"
+	);
+
+$result = $newnode->safe_psql(
+	'postgres', q{SELECT datlocprovider FROM pg_database WHERE datname='template0'});
+is($result, $original_provider,
+		"expected template0 in upgraded cluster to have locale provider '$original_provider'"
+	);
+
 # Second dump from the upgraded instance.
 @dump_command = (
 	'pg_dumpall', '--no-sync', '-d', $newnode->connstr('postgres'),
-- 
2.34.1

v5-0001-Build-ICU-support-by-default.patchtext/x-patch; charset=UTF-8; name=v5-0001-Build-ICU-support-by-default.patchDownload

From 949de097cdd5188a49f10b7f87fa1df4cecf2f63 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Feb 2023 12:08:11 -0800
Subject: [PATCH v5 1/3] Build ICU support by default.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.camel@j-davis.com
---
 .cirrus.yml                    |  1 +
 configure                      | 36 ++++++----------
 configure.ac                   |  8 +++-
 doc/src/sgml/installation.sgml | 76 +++++++++++++++++++---------------
 4 files changed, 61 insertions(+), 60 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index f212978752..34450a9c7b 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -751,6 +751,7 @@ task:
       time ./configure \
         --host=x86_64-w64-mingw32 \
         --enable-cassert \
+        --without-icu \
         CC="ccache x86_64-w64-mingw32-gcc" \
         CXX="ccache x86_64-w64-mingw32-g++"
       make -s -j${BUILD_JOBS} clean
diff --git a/configure b/configure
index e35769ea73..436c2446e8 100755
--- a/configure
+++ b/configure
@@ -1558,7 +1558,7 @@ Optional Packages:
                           set WAL block size in kB [8]
   --with-CC=CMD           set compiler (deprecated)
   --with-llvm             build with LLVM based JIT support
-  --with-icu              build with ICU support
+  --without-icu           build without ICU support
   --with-tcl              build Tcl modules (PL/Tcl)
   --with-tclconfig=DIR    tclConfig.sh is in DIR
   --with-perl             build Perl modules (PL/Perl)
@@ -8401,7 +8401,9 @@ $as_echo "#define USE_ICU 1" >>confdefs.h
   esac
 
 else
-  with_icu=no
+  with_icu=yes
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
 
 fi
 
@@ -8470,31 +8472,17 @@ fi
 	# Put the nasty error message in config.log where it belongs
 	echo "$ICU_PKG_ERRORS" >&5
 
-	as_fn_error $? "Package requirements (icu-uc icu-i18n) were not met:
-
-$ICU_PKG_ERRORS
-
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 elif test $pkg_failed = untried; then
         { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
 $as_echo "no" >&6; }
-	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
-
-To get pkg-config, see <http://pkg-config.freedesktop.org/>.
-See \`config.log' for more details" "$LINENO" 5; }
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 else
 	ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
 	ICU_LIBS=$pkg_cv_ICU_LIBS
diff --git a/configure.ac b/configure.ac
index af23c15cb2..215e32120f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -853,13 +853,17 @@ AC_SUBST(enable_thread_safety)
 # ICU
 #
 AC_MSG_CHECKING([whether to build with ICU support])
-PGAC_ARG_BOOL(with, icu, no, [build with ICU support],
+PGAC_ARG_BOOL(with, icu, yes, [build without ICU support],
               [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
 AC_MSG_RESULT([$with_icu])
 AC_SUBST(with_icu)
 
 if test "$with_icu" = yes; then
-  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n)
+  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n, [],
+    [AC_MSG_ERROR([ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support.])])
 fi
 
 #
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 5affb64d95..d4d0cbae43 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -146,6 +146,35 @@ documentation.  See standalone-profile.xsl for details.
       <application>pg_restore</application>.
      </para>
     </listitem>
+
+    <listitem>
+     <para>
+      The ICU locale provider (see <xref linkend="locale-providers"/>) is used by default. If you don't want to use it then you must specify the <option>--without-icu</option> option to <filename>configure</filename>. Using this option disables support for ICU collation features (see <xref linkend="collation"/>).
+     </para>
+     <para>
+      ICU support requires the <productname>ICU4C</productname> package to be
+      installed.  The minimum required version of
+      <productname>ICU4C</productname> is currently 4.2.
+     </para>
+
+     <para>
+      By default,
+      <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
+      will be used to find the required compilation options.  This is
+      supported for <productname>ICU4C</productname> version 4.6 and later.
+      For older versions, or if <productname>pkg-config</productname> is not
+      available, the variables <envar>ICU_CFLAGS</envar> and
+      <envar>ICU_LIBS</envar> can be specified to
+      <filename>configure</filename>, like in this example:
+<programlisting>
+./configure ... ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
+</programlisting>
+      (If <productname>ICU4C</productname> is in the default search path
+      for the compiler, then you still need to specify nonempty strings in
+      order to avoid use of <productname>pkg-config</productname>, for
+      example, <literal>ICU_CFLAGS=' '</literal>.)
+     </para>
+    </listitem>
    </itemizedlist>
   </para>
 
@@ -926,40 +955,6 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
-      <varlistentry id="configure-option-with-icu">
-       <term><option>--with-icu</option></term>
-       <listitem>
-        <para>
-         Build with support for
-         the <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
-         library, enabling use of ICU collation
-         features<phrase condition="standalone-ignore"> (see
-         <xref linkend="collation"/>)</phrase>.
-         This requires the <productname>ICU4C</productname> package
-         to be installed.  The minimum required version
-         of <productname>ICU4C</productname> is currently 4.2.
-        </para>
-
-        <para>
-         By default,
-         <productname>pkg-config</productname><indexterm><primary>pkg-config</primary></indexterm>
-         will be used to find the required compilation options.  This is
-         supported for <productname>ICU4C</productname> version 4.6 and later.
-         For older versions, or if <productname>pkg-config</productname> is
-         not available, the variables <envar>ICU_CFLAGS</envar>
-         and <envar>ICU_LIBS</envar> can be specified
-         to <filename>configure</filename>, like in this example:
-<programlisting>
-./configure ... --with-icu ICU_CFLAGS='-I/some/where/include' ICU_LIBS='-L/some/where/lib -licui18n -licuuc -licudata'
-</programlisting>
-         (If <productname>ICU4C</productname> is in the default search path
-         for the compiler, then you still need to specify nonempty strings in
-         order to avoid use of <productname>pkg-config</productname>, for
-         example, <literal>ICU_CFLAGS=' '</literal>.)
-        </para>
-       </listitem>
-      </varlistentry>
-
       <varlistentry id="configure-with-llvm">
        <term><option>--with-llvm</option></term>
        <listitem>
@@ -1231,6 +1226,19 @@ build-postgresql:
 
      <variablelist>
 
+      <varlistentry id="configure-option-without-icu">
+       <term><option>--without-icu</option></term>
+       <listitem>
+        <para>
+         Build without support for the
+         <productname>ICU</productname><indexterm><primary>ICU</primary></indexterm>
+         library, disabling the use of ICU collation features<phrase
+         condition="standalone-ignore"> (see <xref
+         linkend="collation"/>)</phrase>.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-without-readline">
        <term><option>--without-readline</option></term>
        <listitem>
-- 
2.34.1

pgsql@j-davis.com

almost 3 years ago

In reply to: Jeff Davis (#44)

Re: Move defaults toward ICU in 16?

On Fri, 2023-03-03 at 21:45 -0800, Jeff Davis wrote:

0002: update template0 in new cluster (as described above)

I think 0002 is about ready and I plan to commit it soon unless someone
has more comments.

I'm holding off on 0001 for now, because you objected. But I still
think 0001 is a good idea so I'd like to hear more before I withdraw
it.

0003: default initdb to use ICU

This is also about ready, and I plan to commit this soon after 0002.

A different issue: unaccent is calling t_isspace(), which is just not
properly handling locales. t_isspace() always passes NULL as the last
argument to char2wchar. There are TODO comments throughout that file.

I posted a bug report and patch for this issue:

/messages/by-id/79e4354d9eccfdb00483146a6b9f6295202e7890.camel@j-davis.com

Regards,
Jeff Davis

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#45)

Re: Move defaults toward ICU in 16?

On 08.03.23 06:55, Jeff Davis wrote:

On Fri, 2023-03-03 at 21:45 -0800, Jeff Davis wrote:

0002: update template0 in new cluster (as described above)

I think 0002 is about ready and I plan to commit it soon unless someone
has more comments.

0002 seems fine to me.

I'm holding off on 0001 for now, because you objected. But I still
think 0001 is a good idea so I'd like to hear more before I withdraw
it.

Let's come back to that after dealing with the other two.

0003: default initdb to use ICU

This is also about ready, and I plan to commit this soon after 0002.

This seems mostly ok to me. I have a few small comments.

+ default, ICU obtains the ICU locale from the ICU default collator.

This seems to be a fancy way of saying, the default ICU locale will be
set to something that approximates what you have set your operating
system to. Which is what we want, I think. Can we say this in a more
user-friendly way?

+static void
+check_icu_locale()

should be check_icu_locale(void)

+       if (U_ICU_VERSION_MAJOR_NUM >= 54)
+       {

If we're going to add more of these mysterious version checks, could we
add a comment about what they are for?

However, I suspect what this chunk is doing is some sort of
canonicalization/language-tag conversion, which per the other thread, I
have some questions about.

How about for this patch, we skip this part and just do the else branch

+ icu_locale = pg_strdup(default_locale);

and then put the canonicalization business into the canonicalization
patch set?

pgsql@j-davis.com

almost 3 years ago

In reply to: Peter Eisentraut (#46)

Re: Move defaults toward ICU in 16?

On Thu, 2023-03-09 at 10:36 +0100, Peter Eisentraut wrote:

0002 seems fine to me.

Committed 0002 with some test improvements.

Let's come back to that after dealing with the other two.

Leaving 0001 open for now.

0003 committed after addressing your comments.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Jeff Davis (#47)

Re: Move defaults toward ICU in 16?

On 09.03.23 20:14, Jeff Davis wrote:

Let's come back to that after dealing with the other two.

Leaving 0001 open for now.

I suspect making a change like this now would result in a bloodbath on
the build farm that we could do without. I suggest revisiting this
after the commit fest ends.

Peter Eisentraut

peter.eisentraut@enterprisedb.com

almost 3 years ago

In reply to: Peter Eisentraut (#48)

Re: Move defaults toward ICU in 16?

On 16.03.23 14:52, Peter Eisentraut wrote:

On 09.03.23 20:14, Jeff Davis wrote:

Let's come back to that after dealing with the other two.

Leaving 0001 open for now.

I suspect making a change like this now would result in a bloodbath on
the build farm that we could do without. I suggest revisiting this
after the commit fest ends.

I don't object to this patch. I suggest waiting until next week to
commit it and then see what happens. It's easy to revert if it goes
terribly.

pryzby@telsasoft.com

over 2 years ago

In reply to: Peter Eisentraut (#49)

Re: Move defaults toward ICU in 16?

On Wed, Apr 05, 2023 at 09:33:25AM +0200, Peter Eisentraut wrote:

On 16.03.23 14:52, Peter Eisentraut wrote:

On 09.03.23 20:14, Jeff Davis wrote:

Let's come back to that after dealing with the other two.

Leaving 0001 open for now.

I suspect making a change like this now would result in a bloodbath on
the build farm that we could do without.ï¿½ I suggest revisiting this
after the commit fest ends.

I don't object to this patch. I suggest waiting until next week to commit
it and then see what happens. It's easy to revert if it goes terribly.

Is this still being considered for v16 ?

--
Justin

pgsql@j-davis.com

over 2 years ago

In reply to: Justin Pryzby (#50)

Re: Move defaults toward ICU in 16?

On Mon, 2023-04-17 at 08:23 -0500, Justin Pryzby wrote:

I don't object to this patch. I suggest waiting until next week to
commit
it and then see what happens. It's easy to revert if it goes
terribly.

Is this still being considered for v16 ?

Yes, unless someone raises a procedural objection.

Is now a reasonable time to check it in and see what breaks? It looks
like there are quite a few buildfarm members that specify neither --
with-icu nor --without-icu.

Regards,
Jeff Davis

tgl@sss.pgh.pa.us

over 2 years ago

In reply to: Jeff Davis (#51)

Re: Move defaults toward ICU in 16?

Jeff Davis <pgsql@j-davis.com> writes:

Is now a reasonable time to check it in and see what breaks? It looks
like there are quite a few buildfarm members that specify neither --
with-icu nor --without-icu.

I see you just pinged buildfarm-members again, so I'd think it's
polite to give people 24 hours or so to deal with that before
you break things.

(My animals are all set, I believe.)

regards, tom lane

Jonathan S. Katz

jkatz@postgresql.org

over 2 years ago

In reply to: Tom Lane (#52)

Re: Move defaults toward ICU in 16?

On 4/17/23 2:33 PM, Tom Lane wrote:

Jeff Davis <pgsql@j-davis.com> writes:

Is now a reasonable time to check it in and see what breaks? It looks
like there are quite a few buildfarm members that specify neither --
with-icu nor --without-icu.

I see you just pinged buildfarm-members again, so I'd think it's
polite to give people 24 hours or so to deal with that before
you break things.

[RMT hat]

This thread has fallen silent and the RMT wanted to check in.

The RMT did have a brief discussion on $SUBJECT. We agree with several
points that regardless of if/when ICU becomes the default collation
provider for PostgreSQL, we'll likely have to flush out several issues.
The question is how long we want that period to be before releasing the
default.

Right now, and in absence of critical issues or objections, the RMT is
OK with leaving in ICU as the default collation provider for Beta 1. If
we're to revert back to glibc, we recommend doing this before Beta 2.

However, if there are strong objections to this proposal, please do
state them.

Thanks,

Jonathan