new environment variable INITDB_LOCALE_PROVIDER

Started by Jeff Davis6 months ago13 messages
#1Jeff Davis
pgsql@j-davis.com
1 attachment(s)

$SUBJECT makes it easier to test other providers, especially the
regression tests.

For this to be useful, it should avoid throwing an error for plain
"initdb" (without locale flags specified), which means we need defaults
for the builtin locale or the ICU locale. I chose "C.UTF-8" and "und"
(we could also have environment variables for those too, but that would
create some questions when --locale is also specified).

Another benefit is that this would make it easier to change the initdb
default, which is being discussed here:

/messages/by-id/9b259f4c532943e428e9665122f37c099bab250e.camel@j-davis.com

One annoyance is that the tests don't pass when
INITDB_LOCALE_PROVIDER=icu. That's because a lot of tests use either --
locale=C or --no-locale, and ICU doesn't have a way to interpret that.
We could force the provider to be builtin in that case, I suppose.

Another annoyance is that, if INITDB_LOCALE_PROVIDER=builtin, and
LC_CTYPE is not UTF-8-compatible, then we need to force LC_CTYPE=C.
That affects fewer things than it would with the libc provider, but it
still affects some things.

Regards,
Jeff Davis

Attachments:

v1-0001-initdb-new-environment-variable-INITDB_LOCALE_PRO.patchtext/x-patch; charset=UTF-8; name=v1-0001-initdb-new-environment-variable-INITDB_LOCALE_PRO.patchDownload
From e5876ac466d5158d3aafa1cf92dc54ff45a6b996 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 17 Jul 2025 13:07:50 -0700
Subject: [PATCH v1] initdb: new environment variable INITDB_LOCALE_PROVIDER

Controls the locale provider used by initdb.

Also specifies defaults for both the builtin provider and ICU, so that
plain initdb (without locale arguments specified) will succeed for any
provider. For the builtin provider's UTF-8 based locales, if LC_CTYPE
is not compatible with UTF-8, forces LC_CTYPE=C to avoid such an
error.
---
 src/backend/commands/dbcommands.c             |  2 +-
 src/bin/initdb/initdb.c                       | 71 +++++++++++++++----
 src/bin/initdb/t/001_initdb.pl                | 11 +--
 src/bin/scripts/t/020_createdb.pl             | 69 ++++++++++--------
 .../modules/test_escape/t/001_test_escape.pl  |  2 +-
 5 files changed, 104 insertions(+), 51 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 502a45163c8..92a396b8406 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1052,7 +1052,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		dbctype = src_ctype;
 	if (dblocprovider == '\0')
 		dblocprovider = src_locprovider;
-	if (dblocale == NULL)
+	if (dblocale == NULL && dblocprovider == src_locprovider)
 		dblocale = src_locale;
 	if (dbicurules == NULL)
 		dbicurules = src_icurules;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 62bbd08d9f6..60e5c9d4a31 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,9 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_LOCALE_PROVIDER		COLLPROVIDER_LIBC
+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
+#define DEFAULT_ICU_LOCALE			"und"
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
@@ -144,7 +147,7 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
-static char locale_provider = COLLPROVIDER_LIBC;
+static char locale_provider = '\0';
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2468,12 +2471,11 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
-	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
-		pg_fatal("locale must be specified if provider is %s",
-				 collprovider_name(locale_provider));
-
 	if (locale_provider == COLLPROVIDER_BUILTIN)
 	{
+		if (!datlocale)
+			datlocale = DEFAULT_BUILTIN_LOCALE;
+
 		if (strcmp(datlocale, "C") == 0)
 			canonname = "C";
 		else if (strcmp(datlocale, "C.UTF-8") == 0 ||
@@ -2491,11 +2493,13 @@ setlocales(void)
 	{
 		char	   *langtag;
 
+		if (!datlocale)
+			datlocale = DEFAULT_ICU_LOCALE;
+
 		/* canonicalize to a language tag */
 		langtag = icu_language_tag(datlocale);
 		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
 			   langtag, datlocale);
-		pg_free(datlocale);
 		datlocale = langtag;
 
 		icu_validate_locale(datlocale);
@@ -2686,6 +2690,30 @@ setup_locale_encoding(void)
 {
 	setlocales();
 
+	/*
+	 * For the builtin provider (other than the "C" locale), default encoding
+	 * to UTF-8. If lc_ctype is not compatible with UTF-8, also force lc_ctype
+	 * to "C". On windows, all locales are compatible with UTF-8.
+	 */
+	if (!encoding && locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
+	{
+#ifndef WIN32
+		int			ctype_enc = pg_get_encoding_from_locale(lc_ctype, false);
+
+		if (!(ctype_enc == PG_UTF8 ||
+			  ctype_enc == PG_SQL_ASCII))
+		{
+			pg_log_warning("setting LC_CTYPE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_CTYPE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_ctype, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-ctype, or choose a different locale provider.");
+			lc_ctype = "C";
+		}
+#endif
+		encoding = "UTF-8";
+	}
+
 	if (locale_provider == COLLPROVIDER_LIBC &&
 		strcmp(lc_ctype, lc_collate) == 0 &&
 		strcmp(lc_ctype, lc_time) == 0 &&
@@ -2721,10 +2749,11 @@ setup_locale_encoding(void)
 		ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
 
 		/*
-		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. ICU does
-		 * not support SQL_ASCII, so select UTF-8 instead.
+		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. Neither
+		 * ICU nor the builtin provider support SQL_ASCII, so select UTF-8
+		 * instead.
 		 */
-		if (locale_provider == COLLPROVIDER_ICU && ctype_enc == PG_SQL_ASCII)
+		if (locale_provider != COLLPROVIDER_LIBC && ctype_enc == PG_SQL_ASCII)
 			ctype_enc = PG_UTF8;
 
 		if (ctype_enc == -1)
@@ -2773,11 +2802,10 @@ setup_locale_encoding(void)
 		!check_locale_encoding(lc_collate, encodingid))
 		exit(1);				/* check_locale_encoding printed the error */
 
-	if (locale_provider == COLLPROVIDER_BUILTIN)
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
 	{
-		if ((strcmp(datlocale, "C.UTF-8") == 0 ||
-			 strcmp(datlocale, "PG_UNICODE_FAST") == 0) &&
-			encodingid != PG_UTF8)
+		if (encodingid != PG_UTF8)
 			pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
 					 datlocale, "UTF-8");
 	}
@@ -3402,7 +3430,6 @@ main(int argc, char *argv[])
 		}
 	}
 
-
 	/*
 	 * Non-option argument specifies data directory as long as it wasn't
 	 * already specified with -D / --pgdata
@@ -3421,6 +3448,22 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	if (locale_provider == '\0')
+	{
+		char	   *provider_name = getenv("INITDB_LOCALE_PROVIDER");
+
+		if (!provider_name)
+			locale_provider = DEFAULT_LOCALE_PROVIDER;
+		else if (strcmp(provider_name, "builtin") == 0)
+			locale_provider = COLLPROVIDER_BUILTIN;
+		else if (strcmp(provider_name, "icu") == 0)
+			locale_provider = COLLPROVIDER_ICU;
+		else if (strcmp(provider_name, "libc") == 0)
+			locale_provider = COLLPROVIDER_LIBC;
+		else
+			pg_fatal("unrecognized locale provider: %s", provider_name);
+	}
+
 	if (builtin_locale_specified && locale_provider != COLLPROVIDER_BUILTIN)
 		pg_fatal("%s cannot be specified unless locale provider \"%s\" is chosen",
 				 "--builtin-locale", "builtin");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d06..ba3211a4aa6 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -113,14 +113,13 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
+	command_ok(
 		[
 			'initdb', '--no-sync',
 			'--locale-provider' => 'icu',
 			"$tempdir/data2"
 		],
-		qr/initdb: error: locale must be specified if provider is icu/,
-		'locale provider ICU requires --icu-locale');
+		'locale provider ICU default locale');
 
 	command_ok(
 		[
@@ -200,13 +199,15 @@ else
 		'locale provider ICU fails since no ICU support');
 }
 
-command_fails(
+command_like(
 	[
 		'initdb', '--no-sync',
+		'--auth' => 'trust',
 		'--locale-provider' => 'builtin',
 		"$tempdir/data6"
 	],
-	'locale provider builtin fails without --locale');
+	qr/^\s+default collation:\s+C.UTF-8\n/ms,
+	'locale provider builtin defaults to C.UTF-8');
 
 command_ok(
 	[
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index a8293390ede..6003d213e89 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -16,6 +16,9 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->start;
 
+my $datlocprovider = $node->safe_psql('postgres',
+	"SELECT datlocprovider FROM pg_database WHERE datname='template1'");
+
 $node->issues_sql_like(
 	[ 'createdb', 'foobar1' ],
 	qr/statement: CREATE DATABASE foobar1/,
@@ -33,19 +36,6 @@ $node->issues_sql_like(
 
 if ($ENV{with_icu} eq 'yes')
 {
-	# This fails because template0 uses libc provider and has no ICU
-	# locale set.  It would succeed if template0 used the icu
-	# provider.  XXX Maybe split into multiple tests?
-	$node->command_fails(
-		[
-			'createdb',
-			'--template' => 'template0',
-			'--encoding' => 'UTF8',
-			'--locale-provider' => 'icu',
-			'foobar4',
-		],
-		'create database with ICU fails without ICU locale specified');
-
 	$node->issues_sql_like(
 		[
 			'createdb',
@@ -130,14 +120,18 @@ else
 		'create database with ICU fails since no ICU support');
 }
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template0',
-		'--locale-provider' => 'builtin',
-		'tbuiltin1',
-	],
-	'create database with provider "builtin" fails without --locale');
+if ($datlocprovider eq 'c')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template0',
+			'--encoding' => 'UTF8',
+			'--locale-provider' => 'builtin',
+			'foobar4',
+		],
+		'create database with builtin provider fails without locale specified');
+}
 
 $node->command_ok(
 	[
@@ -219,15 +213,30 @@ $node->command_fails(
 	],
 	'create database with provider "builtin" and ICU_RULES=""');
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template1',
-		'--locale-provider' => 'builtin',
-		'--locale' => 'C',
-		'tbuiltin9',
-	],
-	'create database with provider "builtin" not matching template');
+if ($datlocprovider eq 'b')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'libc',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "libc" not matching template');
+}
+else
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'builtin',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "builtin" not matching template');
+}
 
 $node->command_fails([ 'createdb', 'foobar1' ],
 	'fails if database already exists');
diff --git a/src/test/modules/test_escape/t/001_test_escape.pl b/src/test/modules/test_escape/t/001_test_escape.pl
index 0d5aec3ed74..b29f093db28 100644
--- a/src/test/modules/test_escape/t/001_test_escape.pl
+++ b/src/test/modules/test_escape/t/001_test_escape.pl
@@ -12,7 +12,7 @@ $node->init();
 $node->start();
 
 $node->safe_psql('postgres',
-	q(CREATE DATABASE db_sql_ascii ENCODING "sql_ascii" TEMPLATE template0;));
+	q(CREATE DATABASE db_sql_ascii LOCALE_PROVIDER "builtin" LOCALE "C" ENCODING "sql_ascii" TEMPLATE template0;));
 
 my $cmd =
   [ 'test_escape', '--conninfo', $node->connstr . " dbname=db_sql_ascii" ];
-- 
2.43.0

#2Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
1 attachment(s)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Tue, 2025-07-29 at 16:55 -0700, Jeff Davis wrote:

$SUBJECT makes it easier to test other providers, especially the
regression tests.

Rebased.

Changes:

* Use environment variable name PG_LOCALE_PROVIDER, which seems more
consistent.

* Updated doc.

* If the provider is builtin and the LC_CTYPE or LC_COLLATE environment
variables aren't compatible with UTF-8, it can override those to "C".
But if --locale, --lc-ctype, or --lc-collate are specified and
incompatible, they will throw an error instead.

Note: when the provider is builtin, the overriding of LC_CTYPE and
LC_COLLATE don't matter a lot. LC_CTYPE affects the translation of
messages from the OS (but not Postgres messages), as well as a few
other places that are likely to be fixed soon (e.g. [1]/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com). LC_COLLATE
has no effect when the provider is builtin. In any case, it only
happens when those environment variables aren't compatible with UTF-8,
and the user hasn't specified any locale settings on the command line.
I see this as more of a detail about how the defaults work together
that can easily be corrected if the user specifies something different.

Also note: if PG_LOCALE_PROVIDER=libc (or is unset), there should be no
behavior change with this patch.

I am planning to commit this soon.

Regards,
Jeff Davis

[1]: /messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com

Attachments:

v2-0001-initdb-new-environment-variable-PG_LOCALE_PROVIDE.patchtext/x-patch; charset=UTF-8; name=v2-0001-initdb-new-environment-variable-PG_LOCALE_PROVIDE.patchDownload
From 59fb79143bc2e254cbc892480802c32178d32e02 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 17 Jul 2025 13:07:50 -0700
Subject: [PATCH v2] initdb: new environment variable PG_LOCALE_PROVIDER

Controls the locale provider used by initdb. Useful for testing.

Also specifies defaults for both the builtin provider and ICU, so that
plain initdb (without locale arguments specified) will succeed for any
provider. For the builtin provider's UTF-8 locales, if LC_CTYPE is not
compatible with UTF-8, forces LC_CTYPE=C to avoid such an error.

Discussion: https://postgr.es/m/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
---
 doc/src/sgml/ref/initdb.sgml                  |  11 ++
 src/bin/initdb/initdb.c                       | 103 ++++++++++++++++--
 src/bin/initdb/t/001_initdb.pl                |  11 +-
 src/bin/scripts/t/020_createdb.pl             |  69 +++++++-----
 .../modules/test_escape/t/001_test_escape.pl  |   2 +-
 5 files changed, 150 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 7613174c18b..e0437357d27 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -662,6 +662,17 @@ PostgreSQL documentation
     </listitem>
    </varlistentry>
 
+   <varlistentry id="app-initdb-environment-pg-locale-provider">
+    <term><envar>PG_LOCALE_PROVIDER</envar></term>
+
+    <listitem>
+     <para>
+      Sets the locale provider; can be overridden using the
+      <option>--locale-provider</option> option.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="app-initdb-environment-pg-color">
     <term><envar>PG_COLOR</envar></term>
     <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..3b3ca9c377f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,9 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_LOCALE_PROVIDER		COLLPROVIDER_LIBC
+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
+#define DEFAULT_ICU_LOCALE			"und"
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
@@ -144,7 +147,7 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
-static char locale_provider = COLLPROVIDER_LIBC;
+static char locale_provider = '\0';
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2412,6 +2415,25 @@ icu_validate_locale(const char *loc_str)
 #endif
 }
 
+/*
+ * Is the given locale name UTF-8 compatible?
+ */
+static bool
+utf8_compatible(const char *localename)
+{
+#ifndef WIN32
+	int			ctype_enc;
+
+	Assert(localename != NULL);
+	ctype_enc = pg_get_encoding_from_locale(localename, false);
+
+	return (ctype_enc == PG_UTF8 || ctype_enc == PG_SQL_ASCII);
+#else
+	/* on windows, all locales are compatible with UTF-8 */
+	return true;
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2420,6 +2442,8 @@ icu_validate_locale(const char *loc_str)
 static void
 setlocales(void)
 {
+	bool		ctype_from_env;
+	bool		collate_from_env;
 	char	   *canonname;
 
 	/* set empty lc_* and datlocale values to locale config if set */
@@ -2442,6 +2466,9 @@ setlocales(void)
 			datlocale = locale;
 	}
 
+	ctype_from_env = (lc_ctype == NULL);
+	collate_from_env = (lc_collate == NULL);
+
 	/*
 	 * canonicalize locale names, and obtain any missing values from our
 	 * current environment
@@ -2465,12 +2492,11 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
-	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
-		pg_fatal("locale must be specified if provider is %s",
-				 collprovider_name(locale_provider));
-
 	if (locale_provider == COLLPROVIDER_BUILTIN)
 	{
+		if (!datlocale)
+			datlocale = DEFAULT_BUILTIN_LOCALE;
+
 		if (strcmp(datlocale, "C") == 0)
 			canonname = "C";
 		else if (strcmp(datlocale, "C.UTF-8") == 0 ||
@@ -2488,11 +2514,13 @@ setlocales(void)
 	{
 		char	   *langtag;
 
+		if (!datlocale)
+			datlocale = DEFAULT_ICU_LOCALE;
+
 		/* canonicalize to a language tag */
 		langtag = icu_language_tag(datlocale);
 		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
 			   langtag, datlocale);
-		pg_free(datlocale);
 		datlocale = langtag;
 
 		icu_validate_locale(datlocale);
@@ -2505,6 +2533,46 @@ setlocales(void)
 		pg_fatal("ICU is not supported in this build");
 #endif
 	}
+
+	/*
+	 * If using the builtin provider with a locale requiring UTF-8, avoid
+	 * taking incompatible settings from the environment.
+	 */
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
+	{
+		if (!encoding)
+			encoding = "UTF-8";
+
+		/*
+		 * LC_CTYPE has little effect unless using the libc provider, but does
+		 * still affect some places, such translation of error messages from
+		 * the OS. Overriding it here may be an inconvenience, but in the
+		 * absence of specified locale options, it's the best choice.
+		 *
+		 * XXX: minimize the effects of LC_CTYPE when not using libc.
+		 */
+		if (ctype_from_env && !utf8_compatible(lc_ctype))
+		{
+			pg_log_warning("setting LC_CTYPE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_CTYPE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_ctype, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-ctype, or choose a different locale provider.");
+			lc_ctype = "C";
+		}
+
+		/*
+		 * LC_COLLATE has no effect unless using the libc provider.
+		 */
+		if (collate_from_env && !utf8_compatible(lc_collate))
+		{
+			pg_log_warning("setting LC_COLLATE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_COLLATE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_collate, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-collate, or choose a different locale provider.");
+			lc_collate = "C";
+		}
+	}
 }
 
 /*
@@ -2770,11 +2838,10 @@ setup_locale_encoding(void)
 		!check_locale_encoding(lc_collate, encodingid))
 		exit(1);				/* check_locale_encoding printed the error */
 
-	if (locale_provider == COLLPROVIDER_BUILTIN)
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
 	{
-		if ((strcmp(datlocale, "C.UTF-8") == 0 ||
-			 strcmp(datlocale, "PG_UNICODE_FAST") == 0) &&
-			encodingid != PG_UTF8)
+		if (encodingid != PG_UTF8)
 			pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
 					 datlocale, "UTF-8");
 	}
@@ -3418,6 +3485,22 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	if (locale_provider == '\0')
+	{
+		char	   *provider_name = getenv("PG_LOCALE_PROVIDER");
+
+		if (!provider_name)
+			locale_provider = DEFAULT_LOCALE_PROVIDER;
+		else if (strcmp(provider_name, "builtin") == 0)
+			locale_provider = COLLPROVIDER_BUILTIN;
+		else if (strcmp(provider_name, "icu") == 0)
+			locale_provider = COLLPROVIDER_ICU;
+		else if (strcmp(provider_name, "libc") == 0)
+			locale_provider = COLLPROVIDER_LIBC;
+		else
+			pg_fatal("unrecognized locale provider: %s", provider_name);
+	}
+
 	if (builtin_locale_specified && locale_provider != COLLPROVIDER_BUILTIN)
 		pg_fatal("%s cannot be specified unless locale provider \"%s\" is chosen",
 				 "--builtin-locale", "builtin");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d06..ba3211a4aa6 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -113,14 +113,13 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
+	command_ok(
 		[
 			'initdb', '--no-sync',
 			'--locale-provider' => 'icu',
 			"$tempdir/data2"
 		],
-		qr/initdb: error: locale must be specified if provider is icu/,
-		'locale provider ICU requires --icu-locale');
+		'locale provider ICU default locale');
 
 	command_ok(
 		[
@@ -200,13 +199,15 @@ else
 		'locale provider ICU fails since no ICU support');
 }
 
-command_fails(
+command_like(
 	[
 		'initdb', '--no-sync',
+		'--auth' => 'trust',
 		'--locale-provider' => 'builtin',
 		"$tempdir/data6"
 	],
-	'locale provider builtin fails without --locale');
+	qr/^\s+default collation:\s+C.UTF-8\n/ms,
+	'locale provider builtin defaults to C.UTF-8');
 
 command_ok(
 	[
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index a8293390ede..6003d213e89 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -16,6 +16,9 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->start;
 
+my $datlocprovider = $node->safe_psql('postgres',
+	"SELECT datlocprovider FROM pg_database WHERE datname='template1'");
+
 $node->issues_sql_like(
 	[ 'createdb', 'foobar1' ],
 	qr/statement: CREATE DATABASE foobar1/,
@@ -33,19 +36,6 @@ $node->issues_sql_like(
 
 if ($ENV{with_icu} eq 'yes')
 {
-	# This fails because template0 uses libc provider and has no ICU
-	# locale set.  It would succeed if template0 used the icu
-	# provider.  XXX Maybe split into multiple tests?
-	$node->command_fails(
-		[
-			'createdb',
-			'--template' => 'template0',
-			'--encoding' => 'UTF8',
-			'--locale-provider' => 'icu',
-			'foobar4',
-		],
-		'create database with ICU fails without ICU locale specified');
-
 	$node->issues_sql_like(
 		[
 			'createdb',
@@ -130,14 +120,18 @@ else
 		'create database with ICU fails since no ICU support');
 }
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template0',
-		'--locale-provider' => 'builtin',
-		'tbuiltin1',
-	],
-	'create database with provider "builtin" fails without --locale');
+if ($datlocprovider eq 'c')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template0',
+			'--encoding' => 'UTF8',
+			'--locale-provider' => 'builtin',
+			'foobar4',
+		],
+		'create database with builtin provider fails without locale specified');
+}
 
 $node->command_ok(
 	[
@@ -219,15 +213,30 @@ $node->command_fails(
 	],
 	'create database with provider "builtin" and ICU_RULES=""');
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template1',
-		'--locale-provider' => 'builtin',
-		'--locale' => 'C',
-		'tbuiltin9',
-	],
-	'create database with provider "builtin" not matching template');
+if ($datlocprovider eq 'b')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'libc',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "libc" not matching template');
+}
+else
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'builtin',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "builtin" not matching template');
+}
 
 $node->command_fails([ 'createdb', 'foobar1' ],
 	'fails if database already exists');
diff --git a/src/test/modules/test_escape/t/001_test_escape.pl b/src/test/modules/test_escape/t/001_test_escape.pl
index 0d5aec3ed74..b29f093db28 100644
--- a/src/test/modules/test_escape/t/001_test_escape.pl
+++ b/src/test/modules/test_escape/t/001_test_escape.pl
@@ -12,7 +12,7 @@ $node->init();
 $node->start();
 
 $node->safe_psql('postgres',
-	q(CREATE DATABASE db_sql_ascii ENCODING "sql_ascii" TEMPLATE template0;));
+	q(CREATE DATABASE db_sql_ascii LOCALE_PROVIDER "builtin" LOCALE "C" ENCODING "sql_ascii" TEMPLATE template0;));
 
 my $cmd =
   [ 'test_escape', '--conninfo', $node->connstr . " dbname=db_sql_ascii" ];
-- 
2.43.0

#3Chao Li
li.evan.chao@gmail.com
In reply to: Jeff Davis (#2)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Oct 9, 2025, at 12:27, Jeff Davis <pgsql@j-davis.com> wrote:

* If the provider is builtin and the LC_CTYPE or LC_COLLATE environment
variables aren't compatible with UTF-8, it can override those to "C".
But if --locale, --lc-ctype, or --lc-collate are specified and
incompatible, they will throw an error instead.

Overall the code looks good to me. I do have a question about the design here.

Are we assuming that

* if the settings come from command line options, then the user is intentionally doing that, so we throw an error
* if the settings come from env, then the user might not be aware of them, so we only issue a warning?

If that’s the case, I’m not fully convinced by this design. Since initdb is a one-time operation, I think it would be better to require everything to be explicit.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#4Peter Eisentraut
peter@eisentraut.org
In reply to: Jeff Davis (#2)
Re: new environment variable INITDB_LOCALE_PROVIDER

Am 09.10.2025 um 06:27 schrieb Jeff Davis <pgsql@j-davis.com>:

On Tue, 2025-07-29 at 16:55 -0700, Jeff Davis wrote:

$SUBJECT makes it easier to test other providers, especially the
regression tests.

Rebased.

Changes:

* Use environment variable name PG_LOCALE_PROVIDER, which seems more
consistent.

Is this not something that could already be done using PG_TEST_INITDB_EXTRA_OPTS ?

#5Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#4)
2 attachment(s)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote:

* Use environment variable name PG_LOCALE_PROVIDER, which seems
more
consistent.

Is this not something that could already be done using
PG_TEST_INITDB_EXTRA_OPTS ?

1. PG_LOCALE_PROVIDER is a documented user-facing option, which will
make it easier for users to set their preferred provider in scripts,
etc.

2. This change also creates default locales for the builtin and ICU
providers, so that initdb without any other locale options will succeed
regardless of the provider.

I broke these up into two patches as v3 to make it easier to
understand.

These patches are independently useful, but also important if we ever
want to change the initdb default to builtin or ICU.

Regards,
Jeff Davis

Attachments:

v3-0001-initdb-add-default-locales-for-builtin-and-ICU-pr.patchtext/x-patch; charset=UTF-8; name=v3-0001-initdb-add-default-locales-for-builtin-and-ICU-pr.patchDownload
From 7ce735b1e85b9f3f9ab6d48588de5824667323d2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Oct 2025 10:49:05 -0700
Subject: [PATCH v3 1/2] initdb: add default locales for builtin and ICU
 providers.

Allows initdb to succeed with any provider even if no other locale
options are specified. Will be useful if the provider comes from
another source, like an environment variable, or if we ever change the
initdb default provider.

Discussion: https://postgr.es/m/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
---
 src/bin/initdb/initdb.c           | 84 +++++++++++++++++++++++++++----
 src/bin/initdb/t/001_initdb.pl    | 11 ++--
 src/bin/scripts/t/020_createdb.pl | 69 ++++++++++++++-----------
 3 files changed, 120 insertions(+), 44 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..e2960e5f17c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,8 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
+#define DEFAULT_ICU_LOCALE			"und"
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
@@ -2412,6 +2414,25 @@ icu_validate_locale(const char *loc_str)
 #endif
 }
 
+/*
+ * Is the given locale name UTF-8 compatible?
+ */
+static bool
+utf8_compatible(const char *localename)
+{
+#ifndef WIN32
+	int			ctype_enc;
+
+	Assert(localename != NULL);
+	ctype_enc = pg_get_encoding_from_locale(localename, false);
+
+	return (ctype_enc == PG_UTF8 || ctype_enc == PG_SQL_ASCII);
+#else
+	/* on windows, all locales are compatible with UTF-8 */
+	return true;
+#endif
+}
+
 /*
  * set up the locale variables
  *
@@ -2420,6 +2441,8 @@ icu_validate_locale(const char *loc_str)
 static void
 setlocales(void)
 {
+	bool		ctype_from_env;
+	bool		collate_from_env;
 	char	   *canonname;
 
 	/* set empty lc_* and datlocale values to locale config if set */
@@ -2442,6 +2465,9 @@ setlocales(void)
 			datlocale = locale;
 	}
 
+	ctype_from_env = (lc_ctype == NULL);
+	collate_from_env = (lc_collate == NULL);
+
 	/*
 	 * canonicalize locale names, and obtain any missing values from our
 	 * current environment
@@ -2465,12 +2491,11 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
-	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
-		pg_fatal("locale must be specified if provider is %s",
-				 collprovider_name(locale_provider));
-
 	if (locale_provider == COLLPROVIDER_BUILTIN)
 	{
+		if (!datlocale)
+			datlocale = DEFAULT_BUILTIN_LOCALE;
+
 		if (strcmp(datlocale, "C") == 0)
 			canonname = "C";
 		else if (strcmp(datlocale, "C.UTF-8") == 0 ||
@@ -2488,11 +2513,13 @@ setlocales(void)
 	{
 		char	   *langtag;
 
+		if (!datlocale)
+			datlocale = DEFAULT_ICU_LOCALE;
+
 		/* canonicalize to a language tag */
 		langtag = icu_language_tag(datlocale);
 		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
 			   langtag, datlocale);
-		pg_free(datlocale);
 		datlocale = langtag;
 
 		icu_validate_locale(datlocale);
@@ -2505,6 +2532,46 @@ setlocales(void)
 		pg_fatal("ICU is not supported in this build");
 #endif
 	}
+
+	/*
+	 * If using the builtin provider with a locale requiring UTF-8, avoid
+	 * taking incompatible settings from the environment.
+	 */
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
+	{
+		if (!encoding)
+			encoding = "UTF-8";
+
+		/*
+		 * LC_CTYPE has little effect unless using the libc provider, but does
+		 * still affect some places, such translation of error messages from
+		 * the OS. Overriding it here may be an inconvenience, but in the
+		 * absence of specified locale options, it's the best choice.
+		 *
+		 * XXX: minimize the effects of LC_CTYPE when not using libc.
+		 */
+		if (ctype_from_env && !utf8_compatible(lc_ctype))
+		{
+			pg_log_warning("setting LC_CTYPE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_CTYPE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_ctype, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-ctype, or choose a different locale provider.");
+			lc_ctype = "C";
+		}
+
+		/*
+		 * LC_COLLATE has no effect unless using the libc provider.
+		 */
+		if (collate_from_env && !utf8_compatible(lc_collate))
+		{
+			pg_log_warning("setting LC_COLLATE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_COLLATE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_collate, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-collate, or choose a different locale provider.");
+			lc_collate = "C";
+		}
+	}
 }
 
 /*
@@ -2770,11 +2837,10 @@ setup_locale_encoding(void)
 		!check_locale_encoding(lc_collate, encodingid))
 		exit(1);				/* check_locale_encoding printed the error */
 
-	if (locale_provider == COLLPROVIDER_BUILTIN)
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
 	{
-		if ((strcmp(datlocale, "C.UTF-8") == 0 ||
-			 strcmp(datlocale, "PG_UNICODE_FAST") == 0) &&
-			encodingid != PG_UTF8)
+		if (encodingid != PG_UTF8)
 			pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
 					 datlocale, "UTF-8");
 	}
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d06..ba3211a4aa6 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -113,14 +113,13 @@ SKIP:
 
 if ($ENV{with_icu} eq 'yes')
 {
-	command_fails_like(
+	command_ok(
 		[
 			'initdb', '--no-sync',
 			'--locale-provider' => 'icu',
 			"$tempdir/data2"
 		],
-		qr/initdb: error: locale must be specified if provider is icu/,
-		'locale provider ICU requires --icu-locale');
+		'locale provider ICU default locale');
 
 	command_ok(
 		[
@@ -200,13 +199,15 @@ else
 		'locale provider ICU fails since no ICU support');
 }
 
-command_fails(
+command_like(
 	[
 		'initdb', '--no-sync',
+		'--auth' => 'trust',
 		'--locale-provider' => 'builtin',
 		"$tempdir/data6"
 	],
-	'locale provider builtin fails without --locale');
+	qr/^\s+default collation:\s+C.UTF-8\n/ms,
+	'locale provider builtin defaults to C.UTF-8');
 
 command_ok(
 	[
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index a8293390ede..6003d213e89 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -16,6 +16,9 @@ my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->start;
 
+my $datlocprovider = $node->safe_psql('postgres',
+	"SELECT datlocprovider FROM pg_database WHERE datname='template1'");
+
 $node->issues_sql_like(
 	[ 'createdb', 'foobar1' ],
 	qr/statement: CREATE DATABASE foobar1/,
@@ -33,19 +36,6 @@ $node->issues_sql_like(
 
 if ($ENV{with_icu} eq 'yes')
 {
-	# This fails because template0 uses libc provider and has no ICU
-	# locale set.  It would succeed if template0 used the icu
-	# provider.  XXX Maybe split into multiple tests?
-	$node->command_fails(
-		[
-			'createdb',
-			'--template' => 'template0',
-			'--encoding' => 'UTF8',
-			'--locale-provider' => 'icu',
-			'foobar4',
-		],
-		'create database with ICU fails without ICU locale specified');
-
 	$node->issues_sql_like(
 		[
 			'createdb',
@@ -130,14 +120,18 @@ else
 		'create database with ICU fails since no ICU support');
 }
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template0',
-		'--locale-provider' => 'builtin',
-		'tbuiltin1',
-	],
-	'create database with provider "builtin" fails without --locale');
+if ($datlocprovider eq 'c')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template0',
+			'--encoding' => 'UTF8',
+			'--locale-provider' => 'builtin',
+			'foobar4',
+		],
+		'create database with builtin provider fails without locale specified');
+}
 
 $node->command_ok(
 	[
@@ -219,15 +213,30 @@ $node->command_fails(
 	],
 	'create database with provider "builtin" and ICU_RULES=""');
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template1',
-		'--locale-provider' => 'builtin',
-		'--locale' => 'C',
-		'tbuiltin9',
-	],
-	'create database with provider "builtin" not matching template');
+if ($datlocprovider eq 'b')
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'libc',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "libc" not matching template');
+}
+else
+{
+	$node->command_fails(
+		[
+			'createdb',
+			'--template' => 'template1',
+			'--locale-provider' => 'builtin',
+			'--locale' => 'C',
+			'tbuiltin9',
+		],
+		'create database with provider "builtin" not matching template');
+}
 
 $node->command_fails([ 'createdb', 'foobar1' ],
 	'fails if database already exists');
-- 
2.43.0

v3-0002-initdb-add-PG_LOCALE_PROVIDER-environment-variabl.patchtext/x-patch; charset=UTF-8; name=v3-0002-initdb-add-PG_LOCALE_PROVIDER-environment-variabl.patchDownload
From ee66fa7b75dafa87e134a820bbd55efda82a5b2d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 10 Oct 2025 11:03:59 -0700
Subject: [PATCH v3 2/2] initdb: add PG_LOCALE_PROVIDER environment variable.

Discussion: https://postgr.es/m/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
---
 doc/src/sgml/ref/initdb.sgml                  | 11 +++++++++++
 src/bin/initdb/initdb.c                       | 19 ++++++++++++++++++-
 .../modules/test_escape/t/001_test_escape.pl  |  2 +-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 7613174c18b..e0437357d27 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -662,6 +662,17 @@ PostgreSQL documentation
     </listitem>
    </varlistentry>
 
+   <varlistentry id="app-initdb-environment-pg-locale-provider">
+    <term><envar>PG_LOCALE_PROVIDER</envar></term>
+
+    <listitem>
+     <para>
+      Sets the locale provider; can be overridden using the
+      <option>--locale-provider</option> option.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="app-initdb-environment-pg-color">
     <term><envar>PG_COLOR</envar></term>
     <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e2960e5f17c..3b3ca9c377f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_LOCALE_PROVIDER		COLLPROVIDER_LIBC
 #define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
 #define DEFAULT_ICU_LOCALE			"und"
 
@@ -146,7 +147,7 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
-static char locale_provider = COLLPROVIDER_LIBC;
+static char locale_provider = '\0';
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -3484,6 +3485,22 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	if (locale_provider == '\0')
+	{
+		char	   *provider_name = getenv("PG_LOCALE_PROVIDER");
+
+		if (!provider_name)
+			locale_provider = DEFAULT_LOCALE_PROVIDER;
+		else if (strcmp(provider_name, "builtin") == 0)
+			locale_provider = COLLPROVIDER_BUILTIN;
+		else if (strcmp(provider_name, "icu") == 0)
+			locale_provider = COLLPROVIDER_ICU;
+		else if (strcmp(provider_name, "libc") == 0)
+			locale_provider = COLLPROVIDER_LIBC;
+		else
+			pg_fatal("unrecognized locale provider: %s", provider_name);
+	}
+
 	if (builtin_locale_specified && locale_provider != COLLPROVIDER_BUILTIN)
 		pg_fatal("%s cannot be specified unless locale provider \"%s\" is chosen",
 				 "--builtin-locale", "builtin");
diff --git a/src/test/modules/test_escape/t/001_test_escape.pl b/src/test/modules/test_escape/t/001_test_escape.pl
index 0d5aec3ed74..b29f093db28 100644
--- a/src/test/modules/test_escape/t/001_test_escape.pl
+++ b/src/test/modules/test_escape/t/001_test_escape.pl
@@ -12,7 +12,7 @@ $node->init();
 $node->start();
 
 $node->safe_psql('postgres',
-	q(CREATE DATABASE db_sql_ascii ENCODING "sql_ascii" TEMPLATE template0;));
+	q(CREATE DATABASE db_sql_ascii LOCALE_PROVIDER "builtin" LOCALE "C" ENCODING "sql_ascii" TEMPLATE template0;));
 
 my $cmd =
   [ 'test_escape', '--conninfo', $node->connstr . " dbname=db_sql_ascii" ];
-- 
2.43.0

#6Jeff Davis
pgsql@j-davis.com
In reply to: Chao Li (#3)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Fri, 2025-10-10 at 12:13 +0800, Chao Li wrote:

Are we assuming that 

* if the settings come from command line options, then the user is
intentionally doing that, so we throw an error
* if the settings come from env, then the user might not be aware of
them, so we only issue a warning?

If that’s the case, I’m not fully convinced by this design. Since
initdb is a one-time operation, I think it would be better to require
everything to be explicit.

That would have been ideal a long time ago, but plain "initdb" without
locale options has succeeded for a long time, using information from
the environment. If we make that fail and require the user to specify
the options explicitly, I fear that would be too disruptive to the many
scripts around.

So we need to do something reasonable when the provider is builtin and
LC_CTYPE/LC_COLLATE from the environment are incompatible with UTF-8.
Forcing LC_CTYPE=C and/or LC_COLLATE=C:

* Only happens if:
- the provider is builtin;
- LC_CTYPE/LC_COLLATE come from the environment (i.e.
--locale/--lc-ctype/--lc-collate are unspecified); and
- LC_CTYPE/LC_COLLATE are incompatible with UTF-8.
* Has little practical effect because those settings aren't
used many places when the provider is builtin or ICU.

so I think a warning is acceptable there.

Regards,
Jeff Davis

#7Chao Li
li.evan.chao@gmail.com
In reply to: Jeff Davis (#6)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Oct 11, 2025, at 02:28, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2025-10-10 at 12:13 +0800, Chao Li wrote:

Are we assuming that

* if the settings come from command line options, then the user is
intentionally doing that, so we throw an error
* if the settings come from env, then the user might not be aware of
them, so we only issue a warning?

If that’s the case, I’m not fully convinced by this design. Since
initdb is a one-time operation, I think it would be better to require
everything to be explicit.

That would have been ideal a long time ago, but plain "initdb" without
locale options has succeeded for a long time, using information from
the environment. If we make that fail and require the user to specify
the options explicitly, I fear that would be too disruptive to the many
scripts around.

So we need to do something reasonable when the provider is builtin and
LC_CTYPE/LC_COLLATE from the environment are incompatible with UTF-8.
Forcing LC_CTYPE=C and/or LC_COLLATE=C:

* Only happens if:
- the provider is builtin;
- LC_CTYPE/LC_COLLATE come from the environment (i.e.
--locale/--lc-ctype/--lc-collate are unspecified); and
- LC_CTYPE/LC_COLLATE are incompatible with UTF-8.
* Has little practical effect because those settings aren't
used many places when the provider is builtin or ICU.

so I think a warning is acceptable there.

Thanks for the explanation, that sounds reasonable. In the meantime, my last arguments are:

* If we make that fail, I don’t think that would break existing scripts. Because the default provider is libc and you are introducing a new environment variable to set locale provider, thus a plain initdb will not use builtin provider. Maybe provider can come from PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only issue warnings.

* I am thinking loudly. Builtin provider is more performant but with certain limitations. Some production users may want to try builtin provider for better performance but not being aware of the limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE they want to use, and they set the new environment variable with “builtin” for provider. In this case, failing “initdb” would make the user clearly realize the limitation of builtin provider. Otherwise, if the user also ignores the warning messages, then the database would be created with unexpected ctype, which would lead to loss (time, data, etc.)

If those are not the cases, then I am fine with the design.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#8Jeff Davis
pgsql@j-davis.com
In reply to: Chao Li (#7)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Sat, 2025-10-11 at 08:30 +0800, Chao Li wrote:

* If we make that fail, I don’t think that would break existing
scripts. Because the default provider is libc and you are introducing
a new environment variable to set locale provider, thus a plain
initdb will not use builtin provider. Maybe provider can come from
PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only
issue warnings.

I would like it to be possible to change the initdb default in the
future to "builtin". See:

/messages/by-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com

in that case, initdb should be able to succeed without other options.

* I am thinking loudly. Builtin provider is more performant but with
certain limitations. Some production users may want to try builtin
provider for better performance but not being aware of the
limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE
they want to use, and they set the new environment variable with
“builtin” for provider. In this case, failing “initdb” would make the
user clearly realize the limitation of builtin provider. Otherwise,
if the user also ignores the warning messages, then the database
would be created with unexpected ctype, which would lead to loss
(time, data, etc.)

What limitation and/or loss are you concerned about?

Unless I'm mistaken, LC_CTYPE has very little practical effect when the
provider is builtin and the encoding is UTF-8.

The main effect that I'm aware of is that system errors from the OS
rely on LC_CTYPE for translation. Ordinary Postgres messages don't need
LC_CTYPE, so most of NLS still works even with LC_CTYPE=C; it's just
strerror() that depends on LC_CTYPE for the encoding.

LC_CTYPE also affects full text search parsing, but I'm fixing that as
part of another patch to use the database locale instead.

I think contrib/fuzzystrmatch may be affected.

Callers of pg_strcasecmp() could be affected, but it's mostly used to
compare with ascii anyway.

If you are aware of other areas, please let me know.

Regards,
Jeff Davis

#9Chao Li
li.evan.chao@gmail.com
In reply to: Jeff Davis (#8)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Oct 11, 2025, at 10:06, Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2025-10-11 at 08:30 +0800, Chao Li wrote:

* If we make that fail, I don’t think that would break existing
scripts. Because the default provider is libc and you are introducing
a new environment variable to set locale provider, thus a plain
initdb will not use builtin provider. Maybe provider can come from
PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only
issue warnings.

I would like it to be possible to change the initdb default in the
future to "builtin". See:

/messages/by-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com

in that case, initdb should be able to succeed without other options.

Yes, if we decide to along with that path, then what I talked would no longer be valid.

* I am thinking loudly. Builtin provider is more performant but with
certain limitations. Some production users may want to try builtin
provider for better performance but not being aware of the
limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE
they want to use, and they set the new environment variable with
“builtin” for provider. In this case, failing “initdb” would make the
user clearly realize the limitation of builtin provider. Otherwise,
if the user also ignores the warning messages, then the database
would be created with unexpected ctype, which would lead to loss
(time, data, etc.)

What limitation and/or loss are you concerned about?

For limitation of builtin provide, I just meant it supports less LC_CTYPE/LC_COLLATE than the other two providers.

I wasn’t concerned about anything, I was just imaging if anything could get a negative impact.

Unless I'm mistaken, LC_CTYPE has very little practical effect when the
provider is builtin and the encoding is UTF-8.

The main effect that I'm aware of is that system errors from the OS
rely on LC_CTYPE for translation. Ordinary Postgres messages don't need
LC_CTYPE, so most of NLS still works even with LC_CTYPE=C; it's just
strerror() that depends on LC_CTYPE for the encoding.

LC_CTYPE also affects full text search parsing, but I'm fixing that as
part of another patch to use the database locale instead.

I think contrib/fuzzystrmatch may be affected.

Callers of pg_strcasecmp() could be affected, but it's mostly used to
compare with ascii anyway.

If you are aware of other areas, please let me know.

Thanks for the explanation. I think I am good now. The latest v3 patch looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#10Peter Eisentraut
peter@eisentraut.org
In reply to: Jeff Davis (#5)
Re: new environment variable INITDB_LOCALE_PROVIDER

On 10.10.25 20:09, Jeff Davis wrote:

On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote:

* Use environment variable name PG_LOCALE_PROVIDER, which seems
more
consistent.

Is this not something that could already be done using
PG_TEST_INITDB_EXTRA_OPTS ?

1. PG_LOCALE_PROVIDER is a documented user-facing option, which will
make it easier for users to set their preferred provider in scripts,
etc.

2. This change also creates default locales for the builtin and ICU
providers, so that initdb without any other locale options will succeed
regardless of the provider.

I broke these up into two patches as v3 to make it easier to
understand.

These patches are independently useful, but also important if we ever
want to change the initdb default to builtin or ICU.

I'm skeptical that we want user-facing environment variables to provide
initdb defaults. The use for that hasn't really been explained.

For example, I don't recall anyone asking for an environment variable to
determine the checksum default.

If we did that, then we might end up with an environment variable per
option, which would be a lot.

The locale options are already complicated enough; adding more ways to
set them with new ways that they interact with other options, this adds
a lot more complications.

I think in practice initdb is mostly run through packager-provided
infrastructure, so this facility would probably have very little impact
in practice.

#11Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#10)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Tue, 2025-10-14 at 21:51 +0200, Peter Eisentraut wrote:

I'm skeptical that we want user-facing environment variables to
provide
initdb defaults.  The use for that hasn't really been explained.

One motivation was to make it smoother to change the initdb default
provider:

/messages/by-id/9b259f4c532943e428e9665122f37c099bab250e.camel@j-davis.com
/messages/by-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com

if we were to make that change, then users might have existing scripts
and want to use the environment variable to switch it back to libc
without modifying the scripts.

If you think we can change the initdb default without introducing an
environment variable, then perhaps we don't need v3-0002. What do you
think about v3-0001?

Regards,
Jeff Davis

#12Peter Eisentraut
peter@eisentraut.org
In reply to: Jeff Davis (#11)
Re: new environment variable INITDB_LOCALE_PROVIDER

On 14.10.25 22:45, Jeff Davis wrote:

If you think we can change the initdb default without introducing an
environment variable, then perhaps we don't need v3-0002. What do you
think about v3-0001?

I think in general it could make sense to do something like

+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
+#define DEFAULT_ICU_LOCALE		"und"

But that patch adds a nontrivial amount of new logic and contains no
documentation, and some of the additional information is spread over
other threads, so I don't have the whole picture.

#13Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#12)
Re: new environment variable INITDB_LOCALE_PROVIDER

On Fri, 2025-10-17 at 17:04 +0200, Peter Eisentraut wrote:

But that patch adds a nontrivial amount of new logic and contains no
documentation, and some of the additional information is spread over
other threads, so I don't have the whole picture.

That logic is there because there are still ways that LC_CTYPE matters
even when the provider is builtin or ICU, which means we need to make
an effort to get the LC_CTYPE from the environment and deal with the
problem cases.

I'm trying to eliminate the need for LC_CTYPE at all when
datlocprovider<>'c', and then we can just set datcollate and datctype
to NULL. I think it's possible, and I'll keep working on that. Perhaps
you'd like to see more progress in that direction before we commit this
patch?

Regards,
Jeff Davis