Change initdb default to the builtin collation provider

Started by Jeff Davis6 months ago24 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

-------
Summary
-------

The libc collation provider is a bad default[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com. The builtin collation
provider is a good default, so let's use that.

----------
Motivation
----------

The initdb default is what we choose for new postgres instances when
we have little information about the user's collation needs. The
default has no effect on upgrades, which always use the previous
instance's locale settings.

There are a number of factors to consider when choosing a default:

* Risk of primary key inconsistencies due to libc or ICU updates
* Performance
* Quality of query semantics for a variety of scripts, languages and
regions (excluding the final result order)
* Final result ordering / display

In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.

With the builtin provider, there's no risk of primary key or plain
index inconsistencies, the performance is great (ordering with
memcmp()), and the query semantics are based on Unicode.

-------------------
Why does it matter?
-------------------

Arguably, we could just not worry and let various service providers,
tools, scripts, packages, and wrappers make the choice independently.

But that just moves the problem -- someone still needs to make that
choice. Collectively, we have built up some knowledge about collation
here on -hackers, and it would be good to offer it as guidance. And an
initdb default is a good way to offer that guidance.

Unifying around one default also creates a more consistent, tested,
and documented Postgres experience that benefits hackers and users
alike.

--------------------------------
What's the catch? Display order.
--------------------------------

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

Other providers offer a better final display order, but it comes at a
heavy price: index inconsistencies and poor performance. Those
problems are hard to address in an existing system. Some users may be
willing to pay that price, but it should be opt-in.

Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.

In general, there's little reason to think that a single locale for
display order is enough for a whole database. Often, databases are
used (directly or indirectly) by people from dozens of locales. When
per-locale display order becomes an issue, it will be necessary to add
COLLATE clauses or application logic to tailor to the end user
regardless, so the database default locale won't be useful.

For all of these reasons, display order is the wrong thing to optimize
for when the user doesn't specify anything. We should prioritize the
other factors, and for those other factors, the builtin provider is
the best.

------------
Why not ICU?
------------

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

------------------
Other Alternatives
------------------

In a previous thread[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com, I laid out some alternatives. If someone
disagrees with this proposal, please choose one of those or suggest a
new one.

The most interesting alternative, in my opinion, is #4, but that was
soundly rejected.

---------------------
Which builtin locale?
---------------------

All builtin locales use the exact same ordering: they sort by code
point. Code point order is stable, so primary keys and plain indexes
remain consistent across upgrades forever.

The difference is in case conversion and character classification
semantics:

1. C: Only basic ASCII semantics which never change.

2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this
is similar to the libc "C.UTF-8" locale available on Linux. It's
also similar to the default semantics of at least one big
commercial database, making migrations easier.

3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more
aligned with the SQL standard, which specifies in an example the
uppercase of 'ß' is 'SS'.

For the latter two locales, expression and partial indexes depending
on these semantics may be subject to inconsistencies after a Unicode
update.

I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.

-------
Details
-------

The mechanics of the default itself are being worked out here[2]/messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com. The
concrete proposal here is to commit those patches, and then
change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and
DEFAULT_BUILTIN_LOCALE to whatever we choose here.

Note: the builtin provider requires UTF-8, which can potentially
conflict with the LC_CTYPE. Fortunately, when the builtin provider is
being used, LC_CTPE has little effect. To further reduce the
consequences of LC_CTYPE when using the builtin provider, another
patch[3]/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com fixes tsearch to parse based on the database default locale
rather than depending on LC_CTYPE.

Comments welcome.

Regards,
Jeff Davis

[1]: /messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
[2]: /messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
/messages/by-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
[3]: /messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
/messages/by-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com

#2Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Davis (#1)
Re: Change initdb default to the builtin collation provider

On 11.10.25 02:48, Jeff Davis wrote:

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but*much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

I remain violently opposed to this idea. I don't understand how it
could be acceptable to just not provide a good display order by default
and have everyone rewrite their queries.

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

I don't understand. We have a versioning system for ICU collations?
Does it not work?

#3Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#2)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:

I remain violently opposed to this idea.  I don't understand how it
could be acceptable to just not provide a good display order by
default
and have everyone rewrite their queries.

I assume that you favor alternative 3 listed here[1]/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com, which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get the
locale from the environment at initdb time?

One thing you may not have considered is that if the provider is
builtin, a lot more users are likely to learn about and use ICU,
because they will see an unfriendly display order and try to figure out
why. Then they'll be more prepared for upgrades and more likely to see
and respond to a version mismatch.

I don't understand.  We have a versioning system for ICU collations?
Does it not work?

I have 27 versions of ICU installed by compiling them from source, and
I compile Postgres in my sleep, so it's fine for me.

But for the default user, who's never really considered collation until
after they are already in trouble, having inconsistent primary keys all
over the place is not a great experience. ICU is certainly better than
libc, but I still think people should approach it with non-zero
knowledge.

Regards,
Jeff Davis

[1]: /messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
/messages/by-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com

#4Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#3)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-17 at 15:02 -0700, Jeff Davis wrote:

On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:

I remain violently opposed to this idea.  I don't understand how it
could be acceptable to just not provide a good display order by
default
and have everyone rewrite their queries.

I assume that you favor alternative 3 listed here[1], which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get
the
locale from the environment at initdb time?

Right now we're still stuck with the worst possible default: libc. Can
you make a more concrete counter-proposal here that sorts through some
of the details?

* Should we base the ICU locale on the environment, or just default
everyone to the "und" locale?

* If ICU support is disabled, how does that affect the defaults?

* If using the environment, what happens if the locale is not supported
by ICU (in particular "C" or "C.UTF-8")?

* What would be the default encoding, or should that come from the
environment?

* The ICU provider has some weaknesses around non-UTF8 encodings
because of casts from wchar_t and the use of tolower() in
downcase_identifier(). Are those potential blockers, and if so, are
they fixable?

* Can we try harder to find an acceptable way to use memcmp() for the
indexes by default, at least primary keys, even if the database
collation is ICU? I know that I've argued for this in the past and it's
been soundly rejected[1]/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com, but some variation on this idea could be
worthy of consideration.

Regards,
Jeff Davis

[1]: /messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
/messages/by-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com

#5Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#1)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:

-------
Summary
-------

The libc collation provider is a bad default[1]. The builtin
collation
provider is a good default, so let's use that.

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

Motivation:

* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.

* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".

* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.

Note:

This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.

Regards,
Jeff Davis

Attachments:

v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patchtext/x-patch; charset=UTF-8; name=v1-0001-initdb-prefer-UTF-8-encoding-over-SQL_ASCII.patchDownload+3-4
v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patchtext/x-patch; charset=UTF-8; name=v1-0002-initdb-if-locale-is-C-or-C.UTF-8-use-builtin-prov.patchDownload+25-1
#6Anders Åstrand
anders@449.se
In reply to: Jeff Davis (#1)
Re: Change initdb default to the builtin collation provider

The libc collation provider is a bad default[1]. The builtin collation
provider is a good default, so let's use that.

Agreed! I've been in so many situations where a libc collation being the
default has caused problems down the line, but never in a situation where it
being default has been helpful.

In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.

Even worse is that the current default uses whatever was set in the
environment of the session that invokes initdb. This is very unlikely to be
the default anyone wants, especially since these environment variables
follows through ssh on debian bases systems by default.

Me having sv_SE set on my local computer doesn't make it likely to be a
reasonable default locale if I ssh to a server to run initdb.

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

I'd say that this would be a _good_ feature of choosing a generic unicode
collation by default. It's immediately obvious that you need to do something
if you want ordering according to some specific language's rules.

Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.

If I'm a Turkish person working for a German company and my environment
variables happens to specify tr_TR when I run initdb I have not made a
conscious choice and it may take years before someone reports an issue with
Ö being sorted after O instead of at the end of the alphabet, at which point
recifying the situation can be unnecessarily tricky.

I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.

These are great options for a default for initdb, since we don't have any
knowledge of which language specific collation might be appropriate. Maybe we
should also document that it's recommended to set locale when running
CREATE DATABASE unless the builtin semantics are fine?

--
Anders Åstrand
Percona

#7Jeff Davis
pgsql@j-davis.com
In reply to: Anders Åstrand (#6)
Re: Change initdb default to the builtin collation provider

On Tue, 2026-03-10 at 09:37 +0100, Anders Åstrand wrote:

I'd say that this would be a _good_ feature of choosing a generic
unicode
collation by default. It's immediately obvious that you need to do
something
if you want ordering according to some specific language's rules.

That's a good point: if the default is not subtle, then the user is
likely to notice quickly, and consider it more detail what they
actually want. A default that is subtly wrong might go unnoticed until
it's too late to change.

Regards,
Jeff Davis

#8Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#5)
Re: Change initdb default to the builtin collation provider

On Fri, Oct 31, 2025 at 5:30 PM Jeff Davis <pgsql@j-davis.com> wrote:

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

I don't know if this is exactly the right proposal, but I think it's
probably appropriate to start gently pushing people towards UTF-8
rather than anything else. Unicode has largely won, AFAICT, and the
use cases for anything else are increasingly narrow. I don't think we
should try to be coercive, but there's a reasonable presumption that
people who haven't said what they want probably want UTF8.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

I'm much less convinced about this idea. I think the number of people
who will be unhappy about the less-user-friendly sort order changes is
probably quite high. It's reasonable to want something more stable and
better version-controlled than libc, but switching to a simple
code-point sort seems like a high price to pay for that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Daniel Gustafsson
daniel@yesql.se
In reply to: Robert Haas (#8)
Re: Change initdb default to the builtin collation provider

On 10 Mar 2026, at 16:12, Robert Haas <robertmhaas@gmail.com> wrote:

..there's a reasonable presumption that people who haven't said what they want
probably want UTF8.

+1

--
Daniel Gustafsson

#10Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jeff Davis (#5)
Re: Change initdb default to the builtin collation provider

On Fri, 2025-10-31 at 14:30 -0700, Jeff Davis wrote:

On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:

-------
Summary
-------

The libc collation provider is a bad default[1]. The builtin
collation
provider is a good default, so let's use that.

The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:

0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.

0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.

I think that would be an improvement, but I am still much more in
favor of your original proposal to use the C collation by default.

Peter objected:

I don't understand how it could be acceptable to just not provide
a good display order by default and have everyone rewrite their queries.

I consider it acceptable. Oracle does it like that by default.
Yes, Oracle's behavior is not necessarily what we want to emulate, but I
don't remember hearing Oracle users complain about that (and I have
heard them complain about other things).

He also said:

I don't understand. We have a versioning system for ICU collations?
Does it not work?

Well, it works in that it alerts you that you may have index corruption.
Good - but a default behavior that excludes the possibility of index
corruption after an OS upgrade would work much better for most users.

Yours,
Laurenz Albe

#11Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#8)
Re: Change initdb default to the builtin collation provider

On Tue, 2026-03-10 at 11:12 -0400, Robert Haas wrote:

I don't know if this is exactly the right proposal, but I think it's
probably appropriate to start gently pushing people towards UTF-8
rather than anything else. Unicode has largely won, AFAICT, and the
use cases for anything else are increasingly narrow. I don't think we
should try to be coercive, but there's a reasonable presumption that
people who haven't said what they want probably want UTF8.

If their environment's LC_CTYPE is UTF8-based, they already get UTF-8.
If it isn't, we can either:

(a) Fall back to LC_CTYPE=C, which is the only UTF8-compatible locale
available everywhere. C is actually not a terrible fallback: it doesn't
actually affect many things, because I have moved almost everything to
use the database default locale.

(b) Warn or error unless they explicitly specify the encoding with -E.
But the former is likely to be ignored and the latter is not what I'd
call "gentle".

Which of these do you think is the right approach?

There's narrower question about what we do with LC_CTYPE=C. Currently
we use SQL_ASCII encoding, which doesn't seem like a great default, and
we could change that to default to UTF8. And another question about
whether we change the meaning of --no-locale.

I'm much less convinced about this idea. I think the number of people
who will be unhappy about the less-user-friendly sort order changes
is
probably quite high. It's reasonable to want something more stable
and
better version-controlled than libc, but switching to a simple
code-point sort seems like a high price to pay for that.

Surely inconsistent indexes and poor performance are also a high price,
so how do you weigh the prices against each other?

We sweat over single-digit performance regressions in fairly specific
cases all the time, but here we're 3X slower for index builds:

https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/

and 2-5X slower for Sort:

/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com

and others don't seem very concerned, so I feel like I'm missing
something.

Regards,
Jeff Davis

#12Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#11)
Re: Change initdb default to the builtin collation provider

On Tue, Mar 10, 2026 at 3:04 PM Jeff Davis <pgsql@j-davis.com> wrote:

If their environment's LC_CTYPE is UTF8-based, they already get UTF-8.
If it isn't, we can either:

(a) Fall back to LC_CTYPE=C, which is the only UTF8-compatible locale
available everywhere. C is actually not a terrible fallback: it doesn't
actually affect many things, because I have moved almost everything to
use the database default locale.

(b) Warn or error unless they explicitly specify the encoding with -E.
But the former is likely to be ignored and the latter is not what I'd
call "gentle".

Which of these do you think is the right approach?

I'm a little confused as to how this relates to what you were asking
before. I thought you were proposing to pick UTF-8 rather than
SQL_ASCII when LC_CTYPE=C, but that's not on this list of options. To
be honest, I'd probably be ready to support making the default
encoding UTF8 regardless of the environment, and you have to use -E if
you want anything else. I think there are still people using other
encodings, but I believe it to be a small minority at this point.

There's narrower question about what we do with LC_CTYPE=C. Currently
we use SQL_ASCII encoding, which doesn't seem like a great default, and
we could change that to default to UTF8. And another question about
whether we change the meaning of --no-locale.

I think SQL_ASCII is a terrible default. Nobody actually wants that
unless they're trying to get out of a sticky situation. Making it
opt-in must be right. I do not know what the question about
--no-locale is.

We sweat over single-digit performance regressions in fairly specific
cases all the time, but here we're 3X slower for index builds:

https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/

and 2-5X slower for Sort:

/messages/by-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com

and others don't seem very concerned, so I feel like I'm missing
something.

<insert shrug emoji here>

At the end of the day, we're all just guessing. My experience working
for EDB is that we have a number of customers who care about sort
order quite a lot, and we've had to sweat blood to make them happy.
And, on a personal level, I have a hard time understanding why anyone
would be OK with a sort order that puts Álvaro after Zebra instead of
between Alvaro and Beatriz, because that seems extremely frustrating.
However, these are just personal biases. I'm much more likely to hear
from the customers who care a lot about the details of how something
works than I am to hear from the customers who are perfectly happy to
take the defaults, because people who are happy don't contact support
at all and people who are unhappy about relatively normal things get
handled by support; I get the weird cases. And everybody is going to
have different experiences. Presumably, your experience is that the
indexing and sorting performance is a big concern for the users you
support, and that's why you favor prioritizing that part of the
experience. That's perfectly legitimate, but it's different from my
experience. My experience is that when I tell people they can use
collate "C" to speed up sorting, they tell me that's a stupid
workaround that doesn't give them the answers that they want, which
obviously colors my viewpoint on this question in the same way that
your experiences color yours.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Daniel Verite
daniel@manitou-mail.org
In reply to: Robert Haas (#12)
Re: Change initdb default to the builtin collation provider

Robert Haas wrote:

To be honest, I'd probably be ready to support making the default
encoding UTF8 regardless of the environment, and you have to use -E
if you want anything else. I think there are still people using
other encodings, but I believe it to be a small minority at this
point.

It would be interesting to have the point of view of Asian users about
this. Recently, the suggestion to retire GB18030 in favor of UTF-8 was
met with the objection that GB18030 was likely preferred by users from
China [1]/messages/by-id/45b4b689-0e78-4d30-a5f9-1a39d01ab2b7@ww-it.cn. Another example against UTF-8 that I found notable, is
Tatsuo Ishii mentioning that Japanese users tend use --no-locale
rather than UTF-8 locales [2]/messages/by-id/20230608.104535.2171011311090815110.t-ishii@sranhm.sra.co.jp.

Also, it's not obvious how initdb could choose an UTF-8 locale
regardless of the environment.
For instance, let's say it finds LC_ALL="fr_FR.iso885915@euro", what
would it do? Maybe look at the UTF-8 locales on the system. Here's a
subset of what it would find on my system:

C.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
...
tr_TR.utf8

From that kind of list, which locale should it pick and why?

Personally I think that ignoring the environment's LC_* for the
collations would be fine if we went for builtin/C.UTF-8 by default, as
$subject suggests. But the level of enthusiasm for that from the
community seems much lower than it would need to be for that kind of
change to be acceptable.

[1]: /messages/by-id/45b4b689-0e78-4d30-a5f9-1a39d01ab2b7@ww-it.cn
/messages/by-id/45b4b689-0e78-4d30-a5f9-1a39d01ab2b7@ww-it.cn
[2]: /messages/by-id/20230608.104535.2171011311090815110.t-ishii@sranhm.sra.co.jp
/messages/by-id/20230608.104535.2171011311090815110.t-ishii@sranhm.sra.co.jp

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#14Jeff Davis
pgsql@j-davis.com
In reply to: Laurenz Albe (#10)
Re: Change initdb default to the builtin collation provider

On Tue, 2026-03-10 at 18:42 +0100, Laurenz Albe wrote:

I consider it acceptable.  Oracle does it like that by default.
Yes, Oracle's behavior is not necessarily what we want to emulate,
but I
don't remember hearing Oracle users complain about that (and I have
heard them complain about other things).

Thank you. That's useful indirect evidence of what a significant number
of users want, or at least what they find acceptable.

Regards,
Jeff Davis

#15Jeff Davis
pgsql@j-davis.com
In reply to: Daniel Verite (#13)
Re: Change initdb default to the builtin collation provider

On Wed, 2026-03-11 at 17:28 +0100, Daniel Verite wrote:

Also, it's not obvious how initdb could choose an UTF-8 locale
regardless of the environment.

Right.

Personally I think that ignoring the environment's LC_* for the
collations would be fine if we went for builtin/C.UTF-8 by default,
as
$subject suggests.  But the level of enthusiasm for that from the
community seems much lower than it would need to be for that kind of
change to be acceptable.

At this point in the release cycle, it seems too late for $subject to
happen in 19. But I do find this discussion useful, so I hope we can
continue while it has some attention.

Regards,
Jeff Davis

#16Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#12)
Re: Change initdb default to the builtin collation provider

On Wed, 2026-03-11 at 08:47 -0400, Robert Haas wrote:

At the end of the day, we're all just guessing.

Part of the reason for that is that changing collation is so difficult
that we have very few examples of users moving real workloads from one
collation to another.

My experience working
for EDB is that we have a number of customers who care about sort
order quite a lot, and we've had to sweat blood to make them happy.

Thank you. I have one burning question: for these users who care deeply
about sort order, which scenario best describes their needs?

(a) they mostly work in a single locale (if so, does it match their
UNIX environment?); or

(b) one locale (which one?) is good enough for a variety of locales
because even if it's not perfect, it's still better than ASCII; or

(c) they somehow partition their data by locale and use multiple
locales; or

(d) they have a variety of indexes on the same column using different
collations to satisfy queries from users in different locales

I have found it very difficult to get an answer to that question. When
I press users for details (in the sample of users I've been able to
reach), usually they back off on the need for sort order, and instead
focus on case insensitivity (in which case I suggest the builtin C.UTF-
8).

And, on a personal level, I have a hard time understanding why anyone
would be OK with a sort order that puts Álvaro after Zebra instead of
between Alvaro and Beatriz, because that seems extremely frustrating.

I tend to agree, and I wish we had a way to handle this at a
"presentation" layer rather than pushing the whole thing down into
indexes (storage layer).

In theory, pushing collation down to indexes could offer performance
advantages, but in practice humans don't read a lot of data, so a post-
processing step would be efficient in most cases.

That's perfectly legitimate, but it's different from my
experience. My experience is that when I tell people they can use
collate "C" to speed up sorting, they tell me that's a stupid
workaround that doesn't give them the answers that they want, which
obviously colors my viewpoint on this question in the same way that
your experiences color yours.

"C" is especially unappealing because it doesn't even get basic case
transformations right outside of ASCII.

Regards,
Jeff Davis

#17Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Robert Haas (#12)
Re: Change initdb default to the builtin collation provider

On Wed, 2026-03-11 at 08:47 -0400, Robert Haas wrote:

My experience is that when I tell people they can use
collate "C" to speed up sorting, they tell me that's a stupid
workaround that doesn't give them the answers that they want, which
obviously colors my viewpoint on this question in the same way that
your experiences color yours.

That makes sense - I would be surprised if everybody were happy with
the C collation's sort order. On the other hand, I have had lots of
reports about corrupted indexes that need rebuilding (only today one
person in my course mentioned it), and I find that people don't exactly
appreciate the prospect of having to rebuild dozens of indexes after
an upgrade, when they want to keep the down time short.

My vision of a better future is like this: PostgreSQL defaults to the
C collation. People will express unhappiness about the way names
get sorted. "Easy", we tell them, "change that column's collation to
a natural language collation". They do it and are happy.

The big advantage: if you have only two or three indexes in your
database that are sorted in a collation other than C, the likelihood
for index corruption will be way lower. For example, the unique
constraint on your part number column that contains values like
'XY-1-13*' or '*P1-12_A' (which are pretty likely to be affected by
the subtle changes in libc collations) will be sorted in the C
collation, which is just fine for everybody.

This approach to collations seems to work well for Oracle users,
so why not for us?

Yours,
Laurenz Albe

#18Jeff Davis
pgsql@j-davis.com
In reply to: Laurenz Albe (#17)
Re: Change initdb default to the builtin collation provider

On Wed, 2026-03-11 at 21:05 +0100, Laurenz Albe wrote:

The big advantage: if you have only two or three indexes in your
database that are sorted in a collation other than C, the likelihood
for index corruption will be way lower.  For example, the unique
constraint on your part number column that contains values like
'XY-1-13*' or '*P1-12_A' (which are pretty likely to be affected by
the subtle changes in libc collations) will be sorted in the C
collation, which is just fine for everybody.

Agreed. The collation problems are not because it's used in the handful
of indexes where it's useful; the problems happen when it's used
everywhere.

If a collation version change is detected, and a few indexes need to be
REINDEXed, I think users understand that. But it's pretty difficult to
explain to a user that all text indexes (including primary keys) need
to be reindexed, and that there's no way to keep track of what still
needs to be done.

Regards,
Jeff Davis

#19Anders Åstrand
anders@449.se
In reply to: Robert Haas (#12)
Re: Change initdb default to the builtin collation provider

And, on a personal level, I have a hard time understanding why anyone
would be OK with a sort order that puts Álvaro after Zebra instead of
between Alvaro and Beatriz, because that seems extremely frustrating.

And I would find it extremly frustrating if Åstrand was sorted between
Alvaro and Beatriz instead of after Zebra where it belongs.

This is kind of the point. There is no generally correct way to sort
alphabetically, which is why we should force anyone wanting to do so to
make an active choice about which rules should be used. I think a
default that's obviously wrong is better than a default that's subtly
wrong in this case.

--
Anders Åstrand
Percona

#20Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#16)
Re: Change initdb default to the builtin collation provider

On Wed, Mar 11, 2026 at 2:20 PM Jeff Davis <pgsql@j-davis.com> wrote:

Part of the reason for that is that changing collation is so difficult
that we have very few examples of users moving real workloads from one
collation to another.

Yes. I think actually one of the big challenges right now is making
sure that when you initdb to do a pg_upgrade, you get the right
settings to make the upgrade work. The extent to which any given
proposal makes that better or worse is definitely a point to consider.

Thank you. I have one burning question: for these users who care deeply
about sort order, which scenario best describes their needs?

(a) they mostly work in a single locale (if so, does it match their
UNIX environment?); or

(b) one locale (which one?) is good enough for a variety of locales
because even if it's not perfect, it's still better than ASCII; or

(c) they somehow partition their data by locale and use multiple
locales; or

(d) they have a variety of indexes on the same column using different
collations to satisfy queries from users in different locales

I don't have total information, but I think they mostly use a single
locale. If they have extremely specific needs, they are likely to end
up with ICU, else they pick a glibc locale. I have no idea how likely
that glibc locale is to match their environment. I wouldn't bet on it
being the norm, but I wouldn't bet against whatever they have in the
environment being more usable than "C".

And, on a personal level, I have a hard time understanding why anyone
would be OK with a sort order that puts Álvaro after Zebra instead of
between Alvaro and Beatriz, because that seems extremely frustrating.

I tend to agree, and I wish we had a way to handle this at a
"presentation" layer rather than pushing the whole thing down into
indexes (storage layer).

In theory, pushing collation down to indexes could offer performance
advantages, but in practice humans don't read a lot of data, so a post-
processing step would be efficient in most cases.

It's tough if people have range scans. Not everybody does, but they
also don't know whether or not they will want them when they're making
setup choices. Picking a locale that matches their desired sort order
*in case* they end up using range scans in some queries feels like the
"safe" coice.

That's perfectly legitimate, but it's different from my
experience. My experience is that when I tell people they can use
collate "C" to speed up sorting, they tell me that's a stupid
workaround that doesn't give them the answers that they want, which
obviously colors my viewpoint on this question in the same way that
your experiences color yours.

"C" is especially unappealing because it doesn't even get basic case
transformations right outside of ASCII.

I completely agree. I dislike it when providers change collation
behavior because I don't really believe the narrative that people from
a particular political unit have a unified view of how sorting should
be done. For example, my native language is English, and I have a view
that Á should go between A and B even though Á is not an English
character. Google says that's not the normal English sort order and
that accented characters should be pushed to the end, but as a native
Englsh speaker I find that idea ridiculous and I can't really imagine
anyone wanting it. Perhaps my imagination is too limited. Where it
gets really subjective is with strings like alvaro, Álvaro, and
.-Alvaro. There's no rule that anyone in the United States learns in
elementary school that answers that question. We can ask users what
they want, but there's not a single right answer that everyone knows,
the way everyone knows a < b. So I feel like it would be pretty
defensible to have the default be something that is either
case-sensitive or not, that either skips nonalphabetic characters or
sorts them in code-point order or handles them in literally any other
basically sane and understandable way that somebody wants to
implement, but something that treats an accented a in the same way it
treats a smiley face emoji -- i.e. like a character that nobody knows
how to sort -- sits really poorly with me. Am I injecting too much of
my personal view into what PostgreSQL's behavior ought to be?
Possibly!

So maybe the people who are saying that defaulting to C is fine and
that making people make an explicit choice if they want something else
are right. but my personal guess is that we will make a bunch of
people unhappy. I'm not actually super-worried about the people who
have very particular ideas about what they want, because there is a
decent chance that they're already asking for exactly that thing. We
might mess them up a little bit, but they'll figure it out. What I'm
most worried about is the population of users -- which I guess to be
large -- who do not have a strong preference but won't be happy with
something as dumb as "C". If even a small fraction of users create a
database using "C" unintentionally and load a terabyte of data into it
before realizing that all their text indexes are sorting "wrong", I
suspect that's not going to be much fun. Said differently, I don't
have an enormous amount of confidence in the environment being a good
source of information about what people want, but my experience thus
far suggests to that "C" probably isn't what they want, which makes me
skeptical of making it the default.

Obviously, I could be wildly incorrect. Maybe people will just be
super-happy about faster sorting and life will be great.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#21)
#23Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Robert Haas (#20)
#24Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#22)