Choosing default collation/ctype

Started by Igor Korot21 days ago14 messagesgeneral
Jump to latest
#1Igor Korot
ikorot01@gmail.com

Hi, ALL,
In the CREATE DATABASE statement I can use encoding/collation/ctype.

I can retrieve the encoding list with:

[code]
SELECT pg_encoding_to_char( conforencoding ) AS name FROM pg_conversion
[/code]

And then I can get a list of collations/ctypes with:

[code]
SELECT collname, collencoding, collprovider collctype FROM pg_collation
[/code]

And then add a logic in my UI to switch collations/ctypes based on encoding.

However, what I wonder is:

Is there a way to select a default collation/ctype for a specific encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

Please advise.

Thank you.

#2Ron
ronljohnsonjr@gmail.com
In reply to: Igor Korot (#1)
Re: Choosing default collation/ctype

On Sun, May 3, 2026 at 3:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi, ALL,
In the CREATE DATABASE statement I can use encoding/collation/ctype.

I can retrieve the encoding list with:

[code]
SELECT pg_encoding_to_char( conforencoding ) AS name FROM pg_conversion
[/code]

And then I can get a list of collations/ctypes with:

[code]
SELECT collname, collencoding, collprovider collctype FROM pg_collation
[/code]

And then add a logic in my UI to switch collations/ctypes based on
encoding.

However, what I wonder is:

Is there a way to select a default collation/ctype for a specific encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

You know your data, not us. The first question I'd try to is "how much
client text data is not compatible with bog-standard UTF8?"

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#3Igor Korot
ikorot01@gmail.com
In reply to: Ron (#2)
Re: Choosing default collation/ctype

Hi,

On Sun, May 3, 2026 at 3:09 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Sun, May 3, 2026 at 3:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi, ALL,
In the CREATE DATABASE statement I can use encoding/collation/ctype.

I can retrieve the encoding list with:

[code]
SELECT pg_encoding_to_char( conforencoding ) AS name FROM pg_conversion
[/code]

And then I can get a list of collations/ctypes with:

[code]
SELECT collname, collencoding, collprovider collctype FROM pg_collation
[/code]

And then add a logic in my UI to switch collations/ctypes based on encoding.

However, what I wonder is:

Is there a way to select a default collation/ctype for a specific encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

You know your data, not us. The first question I'd try to is "how much client text data is not compatible with bog-standard UTF8?"

I don't.
Just trying to create a generic tool to use for people everywhere...

Thank you.

Show quoted text

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#4Ron
ronljohnsonjr@gmail.com
In reply to: Igor Korot (#3)
Re: Choosing default collation/ctype

On Sun, May 3, 2026 at 4:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi,

On Sun, May 3, 2026 at 3:09 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Sun, May 3, 2026 at 3:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi, ALL,
In the CREATE DATABASE statement I can use encoding/collation/ctype.

I can retrieve the encoding list with:

[code]
SELECT pg_encoding_to_char( conforencoding ) AS name FROM pg_conversion
[/code]

And then I can get a list of collations/ctypes with:

[code]
SELECT collname, collencoding, collprovider collctype FROM pg_collation
[/code]

And then add a logic in my UI to switch collations/ctypes based on

encoding.

However, what I wonder is:

Is there a way to select a default collation/ctype for a specific

encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

You know your data, not us. The first question I'd try to is "how much

client text data is not compatible with bog-standard UTF8?"

I don't.
Just trying to create a generic tool to use for people everywhere...

Then choose UTF8.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Ron (#4)
Re: Choosing default collation/ctype

On Sun, 2026-05-03 at 17:05 -0400, Ron Johnson wrote:

Just trying to create a generic tool to use for people everywhere...

Then choose UTF8.

Right! And I recommend "C" for the collation.
(The user can override the default in column definitions where necessary.)

Yours,
Laurenz Albe

#6Igor Korot
ikorot01@gmail.com
In reply to: Ron (#4)
Re: Choosing default collation/ctype

Hi, Ron.

On Sun, May 3, 2026 at 4:05 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Sun, May 3, 2026 at 4:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi,

On Sun, May 3, 2026 at 3:09 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Sun, May 3, 2026 at 3:52 PM Igor Korot <ikorot01@gmail.com> wrote:

Hi, ALL,
In the CREATE DATABASE statement I can use encoding/collation/ctype.

I can retrieve the encoding list with:

[code]
SELECT pg_encoding_to_char( conforencoding ) AS name FROM pg_conversion
[/code]

And then I can get a list of collations/ctypes with:

[code]
SELECT collname, collencoding, collprovider collctype FROM pg_collation
[/code]

And then add a logic in my UI to switch collations/ctypes based on encoding.

However, what I wonder is:

Is there a way to select a default collation/ctype for a specific encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

You know your data, not us. The first question I'd try to is "how much client text data is not compatible with bog-standard UTF8?"

I don't.
Just trying to create a generic tool to use for people everywhere...

Then choose UTF8.

Let me give you a quick run of what I'm trying to do:

In my code I have 3 combo boxes: encoding, collation and ctype.

Initially they all have a value of "Default".

Lets say a user selected "KOI8-R" as an emcoding.

What I will do is populate collation and ctype combo boxes with
values available for such encodings.
But I want to go a little further and change the values in those
to be the default collation/ctype for the "KOI8-R" encoding.

Now are you saying I should choose the one that have "UTF8"
in it?

Thank you.

Show quoted text

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#7Rob Sargent
robjsargent@gmail.com
In reply to: Igor Korot (#6)
Re: Choosing default collation/ctype

On May 3, 2026, at 3:51 PM, Igor Korot <ikorot01@gmail.com> wrote:



Then choose UTF8.

Let me give you a quick run of what I'm trying to do:

In my code I have 3 combo boxes: encoding, collation and ctype.

Initially they all have a value of "Default".

Lets say a user selected "KOI8-R" as an emcoding.

What I will do is populate collation and ctype combo boxes with
values available for such encodings.
But I want to go a little further and change the values in those
to be the default collation/ctype for the "KOI8-R" encoding.

Now are you saying I should choose the one that have "UTF8"
in it?

Thank you.

What is your user trying to do? Create a database? And you are asking for server side settings to use when you invoke a create database command on the user’s behalf? And they will understand all the possibilities and their consequences?

Or are you asking the user which settings to use for this clientside user? If I were this user I would be pissed off you didn’t make some reasonable guess for a starting point from the current environment.

#8Daniel Verite
daniel@manitou-mail.org
In reply to: Laurenz Albe (#5)
Re: Choosing default collation/ctype

Laurenz Albe wrote:

Then choose UTF8.

Right! And I recommend "C" for the collation.

Yet the "C" collation is unsuitable for handling character types
beyond ASCII.
For instance, it considers that accented letters are not letters,
so upper('été') is 'éTé' instead of 'ÉTÉ', and 'é' ~ '\w' is false.

C.UTF-8 solves that, and since Postgres 17, it's available for all operating
systems with the builtin provider.
So if you target Postgres 17+, C.UTF-8 from the builtin provider is
a better choice for UTF-8 databases than "C" .

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#9Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Daniel Verite (#8)
Re: Choosing default collation/ctype

On Mon, 2026-05-04 at 21:34 +0200, Daniel Verite wrote:

Laurenz Albe wrote:

Then choose UTF8.

Right!  And I recommend "C" for the collation.

Yet the "C" collation is unsuitable for handling character types
beyond ASCII.
For instance, it considers that accented letters are not letters,
so upper('été') is 'éTé' instead of 'ÉTÉ', and 'é' ~ '\w' is false.

C.UTF-8 solves that, and since Postgres 17, it's available for all operating
systems with the builtin provider.
So if you target Postgres 17+, C.UTF-8 from the builtin provider is
a better choice for UTF-8 databases than "C" .

Yes, "builtin" and the "C" collation is the best default value.

Yours,
Laurenz Albe

#10Igor Korot
ikorot01@gmail.com
In reply to: Rob Sargent (#7)
Re: Choosing default collation/ctype

Hi, Rob,

On Sun, May 3, 2026 at 9:16 PM Rob Sargent <robjsargent@gmail.com> wrote:

On May 3, 2026, at 3:51 PM, Igor Korot <ikorot01@gmail.com> wrote:



Then choose UTF8.

Let me give you a quick run of what I'm trying to do:

In my code I have 3 combo boxes: encoding, collation and ctype.

Initially they all have a value of "Default".

Lets say a user selected "KOI8-R" as an emcoding.

What I will do is populate collation and ctype combo boxes with
values available for such encodings.
But I want to go a little further and change the values in those
to be the default collation/ctype for the "KOI8-R" encoding.

Now are you saying I should choose the one that have "UTF8"
in it?

Thank you.

What is your user trying to do? Create a database?

Yes - create a new DB.

And you are asking for server side settings to use when you invoke a create database command on the user’s behalf?

"CREATE DATABASE" command have an options to set encoding/collation/ctype.
I thought that if they are used - or at least encoding - it defines a
server side encoding and then
the client will perform the conversion based on conversion function or
the type of the client commecting
ODBC/libpq/psql/etc. Or maybe both.

Am I wrong here?

Also it is possible to supply only the encoding, in which case server
should choose some default values
for collation/ctype for the newly created DB.

And if I provide those in my UI instead of saying "Default" and let
the user choose if they are not satisfied
it would be great. Leaving it as "Default" is an option but it is weird.

And they will understand all the possibilities and their consequences?

"CREATE DATABASE" will be done by DB Admin. From the PG POV no other
user will be able
to do that.
And DB Admin will certainly know what they are doing. If not - they
can consult the help/manual.

Or are you asking the user which settings to use for this clientside user?

No.

If I were this user I would be pissed off you didn’t make some reasonable guess for a starting point from the current environment.

Thank you.

Show quoted text
#11Daniel Verite
daniel@manitou-mail.org
In reply to: Laurenz Albe (#9)
Re: Choosing default collation/ctype

Laurenz Albe wrote:

So if you target Postgres 17+, C.UTF-8 from the builtin provider is
a better choice for UTF-8 databases than "C" .

Yes, "builtin" and the "C" collation is the best default value.

But my point was that, no, it's not.
Let's show a concrete example with Postgres 18:

postgres=# create database dbc
template='template0'
locale_provider='builtin'
builtin_locale='C' ;
CREATE DATABASE

postgres=# \c dbc
You are now connected to database "dbc" as user "postgres".

dbc=# select upper('été');
upper
-------
éTé
(1 row)

It is not the correct uppercasing. On the other hand the "C.UTF-8"
locale, as opposed to "C", produces the correct result.

postgres=# create database dbcutf8
template='template0'
locale_provider='builtin'
builtin_locale='C.UTF-8' ;
CREATE DATABASE

postgres=# \c dbcutf8
You are now connected to database "dbcutf8" as user "postgres".

dbcutf8=# select upper('été');
upper
-------
ÉTÉ
(1 row)

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#12Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Daniel Verite (#11)
Re: Choosing default collation/ctype

On Tue, 2026-05-05 at 13:16 +0200, Daniel Verite wrote:

Laurenz Albe wrote:

So if you target Postgres 17+, C.UTF-8 from the builtin provider is
a better choice for UTF-8 databases than "C" .

Yes, "builtin" and the "C" collation is the best default value.

But my point was that, no, it's not.
Let's show a concrete example with Postgres 18:

[...]

It is not the correct uppercasing.

That is true.
But if you are using "C.UTF-8", the semantics of upper() can change
between versions, if Unicode is upgraded. That bears a residual risk
of OS upgrades breaking indexes on upper(col).

I'd say that the small benefit of better case conversion isn't worth
the risk. I'd chose "C", and use a natural language collation explicitly
on columns where these things matter.

Yours,
Laurenz Albe

#13Daniel Verite
daniel@manitou-mail.org
In reply to: Laurenz Albe (#12)
Re: Choosing default collation/ctype

Laurenz Albe wrote:

But if you are using "C.UTF-8", the semantics of upper() can change
between versions, if Unicode is upgraded.

Oh I see. Sure, "C" is not affected by that.

That bears a residual risk
of OS upgrades breaking indexes on upper(col).

OS upgrades don't count in the case of the builtin provider, but
major Postgres upgrades, yes.

I'd say that the small benefit of better case conversion isn't worth
the risk. I'd chose "C", and use a natural language collation explicitly
on columns where these things matter.

While I understanding the reasoning, I'm of the opposite opinion.
To me the lack of Unicode support in "C" is too annoying to make
it a blanket recommendation as the default locale.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Igor Korot (#1)
Re: Choosing default collation/ctype

On 03.05.26 21:52, Igor Korot wrote:

Is there a way to select a default collation/ctype for a specific encoding?

Or maybe I'm overthinking it and I should let the user choose and if
nothing - just keep those 2 as "Default" and let the server pick it
up. However it will be weird, especially from me as a user POV.

There is no default collation/ctype (locale) for an encoding, only the
other way around. So if a user picks as locale, let's say, de_DE@euro,
then the encoding is automatically LATIN9 (obsolete), if the locale is
de_DE.utf8, then the encoding is UTF8, and so on.