Change initdb default to the builtin collation provider

Started by Jeff Davis3 months ago5 messages
#1Jeff Davis
pgsql@j-davis.com

-------
Summary
-------

The libc collation provider is a bad default[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com. The builtin collation
provider is a good default, so let's use that.

----------
Motivation
----------

The initdb default is what we choose for new postgres instances when
we have little information about the user's collation needs. The
default has no effect on upgrades, which always use the previous
instance's locale settings.

There are a number of factors to consider when choosing a default:

* Risk of primary key inconsistencies due to libc or ICU updates
* Performance
* Quality of query semantics for a variety of scripts, languages and
regions (excluding the final result order)
* Final result ordering / display

In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.

With the builtin provider, there's no risk of primary key or plain
index inconsistencies, the performance is great (ordering with
memcmp()), and the query semantics are based on Unicode.

-------------------
Why does it matter?
-------------------

Arguably, we could just not worry and let various service providers,
tools, scripts, packages, and wrappers make the choice independently.

But that just moves the problem -- someone still needs to make that
choice. Collectively, we have built up some knowledge about collation
here on -hackers, and it would be good to offer it as guidance. And an
initdb default is a good way to offer that guidance.

Unifying around one default also creates a more consistent, tested,
and documented Postgres experience that benefits hackers and users
alike.

--------------------------------
What's the catch? Display order.
--------------------------------

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

Other providers offer a better final display order, but it comes at a
heavy price: index inconsistencies and poor performance. Those
problems are hard to address in an existing system. Some users may be
willing to pay that price, but it should be opt-in.

Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.

In general, there's little reason to think that a single locale for
display order is enough for a whole database. Often, databases are
used (directly or indirectly) by people from dozens of locales. When
per-locale display order becomes an issue, it will be necessary to add
COLLATE clauses or application logic to tailor to the end user
regardless, so the database default locale won't be useful.

For all of these reasons, display order is the wrong thing to optimize
for when the user doesn't specify anything. We should prioritize the
other factors, and for those other factors, the builtin provider is
the best.

------------
Why not ICU?
------------

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

------------------
Other Alternatives
------------------

In a previous thread[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com, I laid out some alternatives. If someone
disagrees with this proposal, please choose one of those or suggest a
new one.

The most interesting alternative, in my opinion, is #4, but that was
soundly rejected.

---------------------
Which builtin locale?
---------------------

All builtin locales use the exact same ordering: they sort by code
point. Code point order is stable, so primary keys and plain indexes
remain consistent across upgrades forever.

The difference is in case conversion and character classification
semantics:

1. C: Only basic ASCII semantics which never change.

2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this
is similar to the libc "C.UTF-8" locale available on Linux. It's
also similar to the default semantics of at least one big
commercial database, making migrations easier.

3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more
aligned with the SQL standard, which specifies in an example the
uppercase of 'ß' is 'SS'.

For the latter two locales, expression and partial indexes depending
on these semantics may be subject to inconsistencies after a Unicode
update.

I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.

-------
Details
-------

The mechanics of the default itself are being worked out here[2]/messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com. The
concrete proposal here is to commit those patches, and then
change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and
DEFAULT_BUILTIN_LOCALE to whatever we choose here.

Note: the builtin provider requires UTF-8, which can potentially
conflict with the LC_CTYPE. Fortunately, when the builtin provider is
being used, LC_CTPE has little effect. To further reduce the
consequences of LC_CTYPE when using the builtin provider, another
patch[3]/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com fixes tsearch to parse based on the database default locale
rather than depending on LC_CTYPE.

Comments welcome.

Regards,
Jeff Davis

[1]: /messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
[2]: /messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
/messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
[3]: /messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com

#2Peter Eisentraut
peter@eisentraut.org
In reply to: Jeff Davis (#1)
Re: Change initdb default to the builtin collation provider

On 11.10.25 02:48, Jeff Davis wrote:

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but*much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

I remain violently opposed to this idea. I don't understand how it
could be acceptable to just not provide a good display order by default
and have everyone rewrite their queries.

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

I don't understand. We have a versioning system for ICU collations?
Does it not work?

#3Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#2)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:

I remain violently opposed to this idea.  I don't understand how it
could be acceptable to just not provide a good display order by
default
and have everyone rewrite their queries.

I assume that you favor alternative 3 listed here[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com, which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get the
locale from the environment at initdb time?

One thing you may not have considered is that if the provider is
builtin, a lot more users are likely to learn about and use ICU,
because they will see an unfriendly display order and try to figure out
why. Then they'll be more prepared for upgrades and more likely to see
and respond to a version mismatch.

I don't understand.  We have a versioning system for ICU collations?
Does it not work?

I have 27 versions of ICU installed by compiling them from source, and
I compile Postgres in my sleep, so it's fine for me.

But for the default user, who's never really considered collation until
after they are already in trouble, having inconsistent primary keys all
over the place is not a great experience. ICU is certainly better than
libc, but I still think people should approach it with non-zero
knowledge.

Regards,
Jeff Davis

[1]: /messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com

#4Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#3)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-17 at 15:02 -0700, Jeff Davis wrote:

On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:

I remain violently opposed to this idea.  I don't understand how it
could be acceptable to just not provide a good display order by
default
and have everyone rewrite their queries.

I assume that you favor alternative 3 listed here[1], which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get
the
locale from the environment at initdb time?

Right now we're still stuck with the worst possible default: libc. Can
you make a more concrete counter-proposal here that sorts through some
of the details?

* Should we base the ICU locale on the environment, or just default
everyone to the "und" locale?

* If ICU support is disabled, how does that affect the defaults?

* If using the environment, what happens if the locale is not supported
by ICU (in particular "C" or "C.UTF-8")?

* What would be the default encoding, or should that come from the
environment?

* The ICU provider has some weaknesses around non-UTF8 encodings
because of casts from wchar_t and the use of tolower() in
downcase_identifier(). Are those potential blockers, and if so, are
they fixable?

* Can we try harder to find an acceptable way to use memcmp() for the
indexes by default, at least primary keys, even if the database
collation is ICU? I know that I've argued for this in the past and it's
been soundly rejected[1]/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com, but some variation on this idea could be
worthy of consideration.

Regards,
Jeff Davis

[1]: /messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com

#5Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
2 attachment(s)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:

-------
Summary
-------

The libc collation provider is a bad default[1]. The builtin
collation
provider is a good default, so let's use that.

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

Motivation:

* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.

* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".

* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.

Note:

This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.

Regards,
Jeff Davis

Attachments:

v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patchtext/x-patch; charset=UTF-8; name=v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patchDownload
From 9c8cf58c541462a6aef43fed0ddea1e9f1633960 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 31 Oct 2025 13:36:46 -0700
Subject: [PATCH v1 1/2] initdb: prefer UTF-8 encoding over SQL_ASCII.

This was already true for the ICU locale provider, make it true for
the others.
---
 src/bin/initdb/initdb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..aa7fc5a6636 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2718,10 +2718,10 @@ setup_locale_encoding(void)
 		ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
 
 		/*
-		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. ICU does
-		 * not support SQL_ASCII, so select UTF-8 instead.
+		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. Prefer
+		 * UTF-8.
 		 */
-		if (locale_provider == COLLPROVIDER_ICU && ctype_enc == PG_SQL_ASCII)
+		if (ctype_enc == PG_SQL_ASCII)
 			ctype_enc = PG_UTF8;
 
 		if (ctype_enc == -1)
-- 
2.43.0

v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patchtext/x-patch; charset=UTF-8; name=v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patchDownload
From 8b1659fab50396eaeacab042aeaef8df241af467 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 31 Oct 2025 14:05:10 -0700
Subject: [PATCH v1 2/2] initdb: if locale is C or C.UTF-8, use builtin
 provider.

If the provider is unspecified, use the builtin provider C or
C.UTF-8. If the provider is specified, then do not override it.

The C locale has always been, effectively, the builtin provider, in
the sense that it uses built-in logic rather than strcoll(), etc. The
change here is mostly about the catalog representation.

The C.UTF-8 locale has used libc, but by doing so, collation doesn't
benefit from important performance optimizations. Now that we have a
builtin "C.UTF-8" collation which does benefit from those
optimizations, use that.
---
 src/bin/initdb/initdb.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index aa7fc5a6636..84931f145f4 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -145,6 +145,7 @@ static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
 static char locale_provider = COLLPROVIDER_LIBC;
+static bool locale_provider_specified = false;
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2465,6 +2466,28 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
+	/*
+	 * If the locale is C or C.UTF-8, and no provider was specified, use the
+	 * builtin provider rather than libc.
+	 */
+	if (!locale_provider_specified && locale_provider == COLLPROVIDER_LIBC)
+	{
+		if (strcmp(lc_ctype, lc_collate) == 0)
+		{
+			if (strcmp(lc_ctype, "C") == 0)
+			{
+				locale_provider = COLLPROVIDER_BUILTIN;
+				datlocale = "C";
+			}
+			else if (strcmp(lc_ctype, "C.UTF-8") == 0 ||
+					 strcmp(lc_ctype, "C.UTF8") == 0)
+			{
+				locale_provider = COLLPROVIDER_BUILTIN;
+				datlocale = "C.UTF-8";
+			}
+		}
+	}
+
 	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
 		pg_fatal("locale must be specified if provider is %s",
 				 collprovider_name(locale_provider));
@@ -3362,6 +3385,8 @@ main(int argc, char *argv[])
 										 "-c debug_discard_caches=1");
 				break;
 			case 15:
+				locale_provider_specified = true;
+
 				if (strcmp(optarg, "builtin") == 0)
 					locale_provider = COLLPROVIDER_BUILTIN;
 				else if (strcmp(optarg, "icu") == 0)
-- 
2.43.0