Collation and primary keys

Started by Jeff Davis6 months ago8 messages

pgsql@j-davis.com

6 months ago

(The problem and possible solutions are not specific to primary keys,
but I'm focusing on PKs for the purposes of this email.)

Currently, users who don't make any explicit choice about collation end
up with primary key indexes that use a libc natural language collation.
This default is exactly wrong: many users run into PK index
inconsistencies, and few understand the risks. A recent conversation I
had reinforced this point: they didn't think about it much at the time
they created the databases, and then ended up with inconsistent PK
indexes.

I have made some attempt to find users who benefit from the default.
I've found a couple possible examples:

* The libc C.UTF-8 locale was a reasonable default (though not a
natural language collation). But now that we have C.UTF-8 available
from the builtin provider, then we should encourage that instead of
relying on the slower, platform-specific libc implementation.

* Our ICU implementation has some gaps around non-UTF8 encodings,
which might push users of those encodings to libc. I'm still not
convinced that such users would want libc collators for all of their
primary keys, though.

Aside from those examples, most users who understand their collation
needs well enough to make an informed choice don't want libc, and
probably want to be selective about which indexes use natural language
vs memcmp.

The problems with the status quo have been discussed at length:

* inconsistent PK indexes
* behavior depends on the specific version of the specific platform
(and can change with OS updates)
* index builds are much slower

Meanwhile, the benefits of the status quo are unclear:

* why would we expect that the user values text ordering enough
to accept the problems listed above?
* why would we expect to want natural-language sort for a PK?
* in a multi-lingual world, why would we expect any one locale
to be so special as to control the default comparator for all
text PKs?
* why would we expect the initdb-time environment to represent
the right locale?

Perhaps the initdb default specifically doesn't matter so much -- that
might be treated as a low-level tool called by packaged scripts. But
that's just moves the problem: someone needs to make that choice on
behalf of users who don't explicitly choose for themselves.

So I think we need to do something. That could be better guidance,
initdb default changes, technical changes, or some combination thereof.

Possibilities include:

1. Status quo.

2. Guide users towards the builtin collation provider, which uses
memcmp() for all text PKs, avoiding all of the problems. The downside
is that results coming from the index will not be sorted in a natural
language order, so 'Z' will sort before 'a'. If that's a problem, the
user can specify a COLLATE clause to their ORDER BY, or sort in their
application. (Possibly change the initdb default.)

3. Guide users towards the ICU "und" locale, which mitigates some of
the problems and provides a compromise natural language sort order that
may be better than memcmp() in a lot of locales. (Possibly change the
initdb default.)

4. Interrupt the chain of defaults somewhere such that PKs end up
defaulting to memcmp() even if the database collation is something
else. This requires some technical changes so that equality searches
still work with the indexes even if the collations don't match (so long
as they are both deterministic). It could cause performance regressions
for range scans on PK indexes. I brought something like this up in the
past[1]/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com and there was a strong consensus that this was a bad idea, but
I'm listing it here anyway.

5. Transform PK values with strxfrm() or u_getSortKey() before indexing
them. While the behavior of the transformation function may change with
a new release of the provider, it seems less likely to cause a problem
for equality searches, and therefore carries a lower risk for PKs. The
downside is that the keys will be larger and there are still some
risks, including bugs in the implementation (which is not just a
theoretical concern).

Other ideas? Thoughts?

Regards,
Jeff Davis

[1]: /messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com

Laurenz Albe

laurenz.albe@cybertec.at

6 months ago

In reply to: Jeff Davis (#1)

Re: Collation and primary keys

On Tue, 2025-07-15 at 17:34 -0700, Jeff Davis wrote:

Currently, users who don't make any explicit choice about collation end
up with primary key indexes that use a libc natural language collation.
This default is exactly wrong: [...]

So I think we need to do something. That could be better guidance,
initdb default changes, technical changes, or some combination thereof.

Possibilities include:

1. Status quo.

2. Guide users towards the builtin collation provider, which uses
memcmp() for all text PKs, avoiding all of the problems. The downside
is that results coming from the index will not be sorted in a natural
language order, so 'Z' will sort before 'a'. If that's a problem, the
user can specify a COLLATE clause to their ORDER BY, or sort in their
application. (Possibly change the initdb default.)

3. Guide users towards the ICU "und" locale, which mitigates some of
the problems and provides a compromise natural language sort order that
may be better than memcmp() in a lot of locales. (Possibly change the
initdb default.)

4. Interrupt the chain of defaults somewhere such that PKs end up
defaulting to memcmp() even if the database collation is something
else. This requires some technical changes so that equality searches
still work with the indexes even if the collations don't match (so long
as they are both deterministic). It could cause performance regressions
for range scans on PK indexes. I brought something like this up in the
past[1] and there was a strong consensus that this was a bad idea, but
I'm listing it here anyway.

5. Transform PK values with strxfrm() or u_getSortKey() before indexing
them. While the behavior of the transformation function may change with
a new release of the provider, it seems less likely to cause a problem
for equality searches, and therefore carries a lower risk for PKs. The
downside is that the keys will be larger and there are still some
risks, including bugs in the implementation (which is not just a
theoretical concern).

Other ideas? Thoughts?

#4 and #5 are appealing in that they attempt to magically do the right
thing without making any hard choices, but I have my doubts.
If somebody uses ALTER TABLE to add a primary key later (perhaps after
having made do with a unique index before) and the query semantics
change, that would not be great.

I have a radical proposal: Rather than having "initdb" default to
whatever locale is in the environment, make it default the the builtin
provider and the C collation. Wherever people need a natural language
collation, they can say so explicitly.

That would do nothing for existing installations, but it would get rid
of the problem for new clusters.

Not that I want to present Oracle as an example to follow in general,
but that's how they are doing it, and while I do hear complaints from
Oracle users, I have yet to hear a complaint about the default binary
collation.

Yours,
Laurenz Albe

Jeff Davis

pgsql@j-davis.com

6 months ago

In reply to: Laurenz Albe (#2)

Re: Collation and primary keys

On Wed, 2025-07-16 at 08:29 +0200, Laurenz Albe wrote:

I have a radical proposal: Rather than having "initdb" default to
whatever locale is in the environment, make it default the the
builtin
provider and the C collation. Wherever people need a natural
language
collation, they can say so explicitly.

You bring up a good sub-point, which is that there are actually three
builtin locales[1]https://www.postgresql.org/docs/devel/locale.html#LOCALE-PROVIDERS: C, C.UTF-8, and PG_UNICODE_FAST. All three have
exactly the same sorting and equality semantics (memcmp()), and
therefore any of them would solve the problems raised in this thread.

Not that I want to present Oracle as an example to follow in general,
but that's how they are doing it, and while I do hear complaints from
Oracle users, I have yet to hear a complaint about the default binary
collation.

My understanding was that, while it does binary sort order, it still
does Unicode-aware case mapping.

If so, that would be closer to the C.UTF-8 locale (Unicode Simple Case
Mapping) or the PG_UNICODE_FAST locale (Unicode Full Case Mapping,
which includes multi-character mappings like 'ß' to 'SS').

Note that the SQL standard seems to require Unicode Full Case Mapping.

Regards,
Jeff Davis

[1]: https://www.postgresql.org/docs/devel/locale.html#LOCALE-PROVIDERS

Laurenz Albe

laurenz.albe@cybertec.at

6 months ago

In reply to: Jeff Davis (#3)

Re: Collation and primary keys

On Wed, 2025-07-16 at 09:46 -0700, Jeff Davis wrote:

On Wed, 2025-07-16 at 08:29 +0200, Laurenz Albe wrote:

I have a radical proposal: Rather than having "initdb" default to
whatever locale is in the environment, make it default the the
builtin provider and the C collation. Wherever people need a natural
language collation, they can say so explicitly.

You bring up a good sub-point, which is that there are actually three
builtin locales[1]: C, C.UTF-8, and PG_UNICODE_FAST. All three have
exactly the same sorting and equality semantics (memcmp()), and
therefore any of them would solve the problems raised in this thread.

Not that I want to present Oracle as an example to follow in general,
but that's how they are doing it, and while I do hear complaints from
Oracle users, I have yet to hear a complaint about the default binary
collation.

My understanding was that, while it does binary sort order, it still
does Unicode-aware case mapping.

If so, that would be closer to the C.UTF-8 locale (Unicode Simple Case
Mapping) or the PG_UNICODE_FAST locale (Unicode Full Case Mapping,
which includes multi-character mappings like 'ß' to 'SS').

Note that the SQL standard seems to require Unicode Full Case Mapping.

I wasn't aware how Oracle handles case mapping, but it seems you
are right:

SQL> SELECT upper('ı'), upper('ä') FROM dual;

U UP
- --
I Ä

Perhaps then using one of the collations you mentioned would be the
best solution.

I'm still a little bit worried that changes in the case mapping might
break some indexes. We have a track record with going against the
standard when it comes to case conversion, so perhaps we wouldn't
spill too much milk if we only convert ASCII correctly.

But perhaps I am just being paranoid.

Yours,
Laurenz Albe

Jeff Davis

pgsql@j-davis.com

6 months ago

In reply to: Laurenz Albe (#4)

Re: Collation and primary keys

On Thu, 2025-07-17 at 08:30 +0200, Laurenz Albe wrote:

I wasn't aware how Oracle handles case mapping, but it seems you
are right:

How does it handle UPPER('ß')? If the result is 'ß', that means it's
similar to the builtin C.UTF-8. If the result is 'SS', that means it's
similar to PG_UNICODE_FAST.

I'm still a little bit worried that changes in the case mapping might
break some indexes. We have a track record with going against the
standard when it comes to case conversion, so perhaps we wouldn't
spill too much milk if we only convert ASCII correctly.

But perhaps I am just being paranoid.

That's a reasonable concern, and I don't mean to dismiss it. But I
believe that problem is two orders of magnitude smaller than the
problems we have with the status quo.

Regards,
Jeff Davis

Laurenz Albe

laurenz.albe@cybertec.at

6 months ago

In reply to: Jeff Davis (#5)

Re: Collation and primary keys

On Thu, 2025-07-17 at 11:59 -0700, Jeff Davis wrote:

On Thu, 2025-07-17 at 08:30 +0200, Laurenz Albe wrote:

I wasn't aware how Oracle handles case mapping, but it seems you
are right:

How does it handle UPPER('ß')? If the result is 'ß', that means it's
similar to the builtin C.UTF-8. If the result is 'SS', that means it's
similar to PG_UNICODE_FAST.

It returns 'ß'.

I'm still a little bit worried that changes in the case mapping might
break some indexes. We have a track record with going against the
standard when it comes to case conversion, so perhaps we wouldn't
spill too much milk if we only convert ASCII correctly.

But perhaps I am just being paranoid.

That's a reasonable concern, and I don't mean to dismiss it. But I
believe that problem is two orders of magnitude smaller than the
problems we have with the status quo.

I agree with that.

Yours,
Laurenz Albe

Daniel Verite

daniel@manitou-mail.org

6 months ago

In reply to: Jeff Davis (#1)

Re: Collation and primary keys

Jeff Davis wrote:

* The libc C.UTF-8 locale was a reasonable default (though not a
natural language collation). But now that we have C.UTF-8 available
from the builtin provider, then we should encourage that instead of
relying on the slower, platform-specific libc implementation.

Yes. In particular, we should encourage the ecosystem to support
the new collation features so that they're widely available to
end users.

Anecdotically I was looking this week at upgrading instances to
v17 with the builtin C.UTF-8 locale.

They happen to be hosted by RDS, and RDS appears not to offer
the builtin provider at instance creation (nor ICU, even though
initdb with ICU is possible since v15).
The templates being restrained to libc, the database creations
should then use template0, locale_provider=builtin,
builtin_locale=...
But in my case, database creations have to be done by TerraForm.
And as it turns out, TF's Postgres resource [1]https://registry.terraform.io/providers/cyrilgdn/postgresql/latest/docs/resources/postgresql_database also hasn't been
updated to allow for non-libc databases (only lc_collate and
lc_ctype can be set).

Conclusion for that upgrade: that will be v17 with libc collations :(

[1]: https://registry.terraform.io/providers/cyrilgdn/postgresql/latest/docs/resources/postgresql_database
https://registry.terraform.io/providers/cyrilgdn/postgresql/latest/docs/resources/postgresql_database

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

Jeff Davis

pgsql@j-davis.com

6 months ago

In reply to: Daniel Verite (#7)

1 attachment(s)

Re: Collation and primary keys

On Wed, 2025-07-23 at 13:53 +0200, Daniel Verite wrote:

* The libc C.UTF-8 locale was a reasonable default (though not a
natural language collation). But now that we have C.UTF-8 available
from the builtin provider, then we should encourage that instead of
relying on the slower, platform-specific libc implementation.

Yes. In particular, we should encourage the ecosystem to support
the new collation features so that they're widely available to
end users.

Then I propose that we change the initdb default to builtin C.UTF-8.
Patch attached.

To get the old initdb behavior use --locale-provider=libc, and all the
other defaults will work as before.

The change would not disrupt upgrades (see commit 9637badd9f).

One annoyance: if your environment has an LC_CTYPE with a non-UTF-8
locale, then initdb forces LC_CTYPE=C and emits a warning.

I had previously tried, and failed, to change the default to ICU for
v16, so it's worth mentioning why I don't believe this proposal will
run into the same problems:

* ICU, while better than libc, didn't completely solve any of the
problems. This proposal completely solves the inconsistent primary key
problem, and is much faster than libc or ICU.

* In the version 16 change, we were still attempting to map environment
variables to ICU locales, which was never going to work very well. In
particular, as you pointed out, ICU has nothing to approximate the
C.UTF-8 locale. The current proposal doesn't attempt that kind of
cleverness.

Comments?

Regards,
Jeff Davis

Attachments:

0001-initdb-default-to-builtin-C.UTF-8.patchtext/x-patch; charset=UTF-8; name=0001-initdb-default-to-builtin-C.UTF-8.patchDownload

From 8ba8f74d28a64bfb006a76fbec64638f55f3660c Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 17 Jul 2025 13:07:50 -0700
Subject: [PATCH] initdb: default to builtin C.UTF-8

Discussion: https://postgr.es/m/918773d0-886b-4ce0-8b74-ae23496ad3ef@manitou-mail.org
---
 src/backend/commands/dbcommands.c             |  2 +-
 src/bin/initdb/initdb.c                       | 50 ++++++++++++++-----
 src/bin/initdb/t/001_initdb.pl                |  6 ++-
 src/bin/scripts/t/020_createdb.pl             | 23 +++++----
 .../modules/test_escape/t/001_test_escape.pl  |  2 +-
 5 files changed, 56 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 502a45163c8..92a396b8406 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1052,7 +1052,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 		dbctype = src_ctype;
 	if (dblocprovider == '\0')
 		dblocprovider = src_locprovider;
-	if (dblocale == NULL)
+	if (dblocale == NULL && dblocprovider == src_locprovider)
 		dblocale = src_locale;
 	if (dbicurules == NULL)
 		dbicurules = src_icurules;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 62bbd08d9f6..ea2a3299737 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -82,6 +82,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 
+#define DEFAULT_BUILTIN_LOCALE		"C.UTF-8"
 
 /* Ideally this would be in a .h file, but it hardly seems worth the trouble */
 extern const char *select_default_timezone(const char *share_path);
@@ -144,7 +145,7 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
-static char locale_provider = COLLPROVIDER_LIBC;
+static char locale_provider = COLLPROVIDER_BUILTIN;
 static bool builtin_locale_specified = false;
 static char *datlocale = NULL;
 static bool icu_locale_specified = false;
@@ -2468,12 +2469,11 @@ setlocales(void)
 	lc_messages = canonname;
 #endif
 
-	if (locale_provider != COLLPROVIDER_LIBC && datlocale == NULL)
-		pg_fatal("locale must be specified if provider is %s",
-				 collprovider_name(locale_provider));
-
 	if (locale_provider == COLLPROVIDER_BUILTIN)
 	{
+		if (!datlocale)
+			datlocale = DEFAULT_BUILTIN_LOCALE;
+
 		if (strcmp(datlocale, "C") == 0)
 			canonname = "C";
 		else if (strcmp(datlocale, "C.UTF-8") == 0 ||
@@ -2491,6 +2491,9 @@ setlocales(void)
 	{
 		char	   *langtag;
 
+		if (!datlocale)
+			pg_fatal("locale must be specified if provider is icu");
+
 		/* canonicalize to a language tag */
 		langtag = icu_language_tag(datlocale);
 		printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
@@ -2686,6 +2689,29 @@ setup_locale_encoding(void)
 {
 	setlocales();
 
+	/*
+	 * For the builtin provider (other than the "C" locale), default encoding
+	 * to UTF-8. If lc_ctype is not compatible with UTF-8, also force lc_ctype
+	 * to "C". On windows, all locales are compatible with UTF-8.
+	 */
+	if (!encoding && locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
+	{
+#ifndef WIN32
+		int ctype_enc = pg_get_encoding_from_locale(lc_ctype, false);
+		if (!(ctype_enc == PG_UTF8 ||
+			  ctype_enc == PG_SQL_ASCII))
+		{
+			pg_log_warning("setting LC_CTYPE to \"C\"");
+			pg_log_warning_detail("Encoding of LC_CTYPE locale \"%s\" does not match encoding required by builtin locale \"%s\".",
+								  lc_ctype, datlocale);
+			pg_log_warning_hint("Specify a UTF-8 compatible locale with --lc-ctype, or choose a different locale provider.");
+			lc_ctype = "C";
+		}
+#endif
+		encoding = "UTF-8";
+	}
+
 	if (locale_provider == COLLPROVIDER_LIBC &&
 		strcmp(lc_ctype, lc_collate) == 0 &&
 		strcmp(lc_ctype, lc_time) == 0 &&
@@ -2721,10 +2747,11 @@ setup_locale_encoding(void)
 		ctype_enc = pg_get_encoding_from_locale(lc_ctype, true);
 
 		/*
-		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. ICU does
-		 * not support SQL_ASCII, so select UTF-8 instead.
+		 * If ctype_enc=SQL_ASCII, it's compatible with any encoding. Neither
+		 * ICU nor the builtin provider support SQL_ASCII, so select UTF-8
+		 * instead.
 		 */
-		if (locale_provider == COLLPROVIDER_ICU && ctype_enc == PG_SQL_ASCII)
+		if (locale_provider != COLLPROVIDER_LIBC && ctype_enc == PG_SQL_ASCII)
 			ctype_enc = PG_UTF8;
 
 		if (ctype_enc == -1)
@@ -2773,11 +2800,10 @@ setup_locale_encoding(void)
 		!check_locale_encoding(lc_collate, encodingid))
 		exit(1);				/* check_locale_encoding printed the error */
 
-	if (locale_provider == COLLPROVIDER_BUILTIN)
+	if (locale_provider == COLLPROVIDER_BUILTIN &&
+		strcmp(datlocale, "C") != 0)
 	{
-		if ((strcmp(datlocale, "C.UTF-8") == 0 ||
-			 strcmp(datlocale, "PG_UNICODE_FAST") == 0) &&
-			encodingid != PG_UTF8)
+		if(encodingid != PG_UTF8)
 			pg_fatal("builtin provider locale \"%s\" requires encoding \"%s\"",
 					 datlocale, "UTF-8");
 	}
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index b7ef7ed8d06..b73e5054108 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -200,13 +200,15 @@ else
 		'locale provider ICU fails since no ICU support');
 }
 
-command_fails(
+command_like(
 	[
 		'initdb', '--no-sync',
+		'--auth' => 'trust',
 		'--locale-provider' => 'builtin',
 		"$tempdir/data6"
 	],
-	'locale provider builtin fails without --locale');
+	qr/^\s+default collation:\s+C.UTF-8\n/ms,
+	'locale provider builtin defaults to C.UTF-8');
 
 command_ok(
 	[
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index a8293390ede..4112acedb17 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -130,15 +130,6 @@ else
 		'create database with ICU fails since no ICU support');
 }
 
-$node->command_fails(
-	[
-		'createdb',
-		'--template' => 'template0',
-		'--locale-provider' => 'builtin',
-		'tbuiltin1',
-	],
-	'create database with provider "builtin" fails without --locale');
-
 $node->command_ok(
 	[
 		'createdb',
@@ -223,11 +214,21 @@ $node->command_fails(
 	[
 		'createdb',
 		'--template' => 'template1',
-		'--locale-provider' => 'builtin',
+		'--locale-provider' => 'icu',
+		'--locale' => 'C',
+		'tbuiltin9',
+	],
+	'create database with provider "icu" not matching template');
+
+$node->command_fails(
+	[
+		'createdb',
+		'--template' => 'template1',
+		'--locale-provider' => 'libc',
 		'--locale' => 'C',
 		'tbuiltin9',
 	],
-	'create database with provider "builtin" not matching template');
+	'create database with provider "libc" not matching template');
 
 $node->command_fails([ 'createdb', 'foobar1' ],
 	'fails if database already exists');
diff --git a/src/test/modules/test_escape/t/001_test_escape.pl b/src/test/modules/test_escape/t/001_test_escape.pl
index 0d5aec3ed74..1de3224a04c 100644
--- a/src/test/modules/test_escape/t/001_test_escape.pl
+++ b/src/test/modules/test_escape/t/001_test_escape.pl
@@ -12,7 +12,7 @@ $node->init();
 $node->start();
 
 $node->safe_psql('postgres',
-	q(CREATE DATABASE db_sql_ascii ENCODING "sql_ascii" TEMPLATE template0;));
+	q(CREATE DATABASE db_sql_ascii LOCALE "C" ENCODING "sql_ascii" TEMPLATE template0;));
 
 my $cmd =
   [ 'test_escape', '--conninfo', $node->connstr . " dbname=db_sql_ascii" ];
-- 
2.43.0