utf8 vs UTF-8

Started by Troels Arvinalmost 2 years ago7 messagesgeneral
Jump to latest
#1Troels Arvin
troels@arvin.dk

Hello,

In a Postgres installation, I have databases where the locale is
slightly different. Which one is correct? Excerpt from "psql --list":

 test1       | loc_test | UTF8     | libc            | en_US.UTF-8 |
en_US.UTF-8
 test3       | troels   | UTF8     | libc            | en_US.utf8 |
en_US.utf8

OS is Ubuntu.

--
Kind regards,
Troels Arvin

#2Hans Schou
hans.schou@gmail.com
In reply to: Troels Arvin (#1)
Re: utf8 vs UTF-8

test3 | troels | UTF8 | libc | en_US.utf8 |

en_US.utf8

It is wrong but I guess it's working?

how did you create test3?

On Fri, May 17, 2024 at 2:44 PM Troels Arvin <troels@arvin.dk> wrote:

Hello,

In a Postgres installation, I have databases where the locale is
slightly different. Which one is correct? Excerpt from "psql --list":

test1 | loc_test | UTF8 | libc | en_US.UTF-8 |
en_US.UTF-8
test3 | troels | UTF8 | libc | en_US.utf8 |
en_US.utf8

OS is Ubuntu.

--
Kind regards,
Troels Arvin

--
𝕳𝖆𝖓𝖘 𝕾𝖈𝖍𝖔𝖚
☏ ➁➁ ➅➃ ➇⓪ ➁⓪

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Troels Arvin (#1)
Re: utf8 vs UTF-8

Troels Arvin <troels@arvin.dk> writes:

In a Postgres installation, I have databases where the locale is
slightly different. Which one is correct? Excerpt from "psql --list":

 test1       | loc_test | UTF8     | libc            | en_US.UTF-8 |
en_US.UTF-8
 test3       | troels   | UTF8     | libc            | en_US.utf8 |
en_US.utf8

On most if not all platforms, both those spellings of the locale names
will be taken as valid. You might try running "locale -a" to get an
idea of which one is preferred according to your current libc
installation ... but TBH, I doubt it's worth worrying about.

regards, tom lane

#4Troels Arvin
troels@arvin.dk
In reply to: Hans Schou (#2)
Re: utf8 vs UTF-8

Hellok

Hans Schou wrote:

  > test3       | troels   | UTF8     | libc | en_US.utf8 |  en_US.utf8

[...]

  how did you create test3?

For this example, I used specified it at creation time:

CREATE DATABASE test3 TEMPLATE template0 LOCALE 'en_US.utf8';

In the real-world example I'm working with I'm unsure how the database
was originally created.

--
Troels

#5Troels Arvin
troels@arvin.dk
In reply to: Tom Lane (#3)
Re: utf8 vs UTF-8

Hello,

Tom Lane wrote:

  test1  | loc_test | UTF8   | libc     | en_US.UTF-8 | en_US.UTF-8
  test3  | troels   | UTF8   | libc     | en_US.utf8  | en_US.utf8

On most if not all platforms, both those spellings of the locale names
will be taken as valid.  You might try running "locale -a" to get an
idea of which one is preferred according to your current libc
installation

"locale -a" on the Ubuntu system outputs this:

  C
  C.utf8
  en_US.utf8
  POSIX

On a CentOS7 system, it's sort-of the same:

  locale -a | grep -i en_us
  en_US
  en_US.iso88591
  en_US.iso885915
  en_US.utf8

So at first, I thought en_US.utf8 would be the most correct locale
identifier. However, when I look at Postgres' own databases, they have
the slightly different locale string:

  psql --list | grep -E 'postgres|template'
  postgres  | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
  template0 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
  template1 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...

Also, when I try to create a database with "en_US.utf8" as locale
without specifying a template:

troels=# create database test4 locale 'en_US.utf8';
ERROR:  new collation (en_US.utf8) is incompatible with the collation of
the template database (en_US.UTF-8)
HINT:  Use the same collation as in the template database, or use
template0 as template.

Given the locale of Postgres' own databases and Postgres' error message,
I'm leaning to en_US.UTF-8 being the most correct locale to use. Because
why would Postgres care about it, if utf8/UTF-8 doesn't matter?

but TBH, I doubt it's worth worrying about.

But couldn't there be an issue, if for example the client's locale and
the server's locale aren't exactly the same? I'm thinking maybe the
client library has to perform unneeded translation of the stream of data
to/from the database?

--
Kind regards,
Troels

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Troels Arvin (#5)
Re: utf8 vs UTF-8

On 5/18/24 07:48, Troels Arvin wrote:

Hello,

Tom Lane wrote:

  test1  | loc_test | UTF8   | libc     | en_US.UTF-8 | en_US.UTF-8
  test3  | troels   | UTF8   | libc     | en_US.utf8  | en_US.utf8

On most if not all platforms, both those spellings of the locale names
will be taken as valid.  You might try running "locale -a" to get an
idea of which one is preferred according to your current libc
installation

"locale -a" on the Ubuntu system outputs this:

  C
  C.utf8
  en_US.utf8
  POSIX

If you expand that to locale -v -a you get:

locale: en_US.utf8 archive: /usr/lib/locale/locale-archive
-------------------------------------------------------------------------------
title | English locale for the USA
source | Free Software Foundation, Inc.
address | https://www.gnu.org/software/libc/
email | bug-glibc-locales@gnu.org
language | American English
territory | United States
revision | 1.0
date | 2000-06-24
codeset | UTF-8

So at first, I thought en_US.utf8 would be the most correct locale
identifier. However, when I look at Postgres' own databases, they have
the slightly different locale string:

  psql --list | grep -E 'postgres|template'
  postgres  | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
  template0 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...
  template1 | postgres | UTF8 | libc | en_US.UTF-8 | en_US.UTF-8 | ...

Also, when I try to create a database with "en_US.utf8" as locale
without specifying a template:

troels=# create database test4 locale 'en_US.utf8';
ERROR:  new collation (en_US.utf8) is incompatible with the collation of
the template database (en_US.UTF-8)
HINT:  Use the same collation as in the template database, or use
template0 as template.

I'm going to say that is Postgres being exact to a fault.

Given the locale of Postgres' own databases and Postgres' error message,
I'm leaning to en_US.UTF-8 being the most correct locale to use. Because
why would Postgres care about it, if utf8/UTF-8 doesn't matter?

but TBH, I doubt it's worth worrying about.

But couldn't there be an issue, if for example the client's locale and
the server's locale aren't exactly the same? I'm thinking maybe the
client library has to perform unneeded translation of the stream of data
to/from the database?

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Adrian Klaver (#6)
Re: utf8 vs UTF-8

Adrian Klaver <adrian.klaver@aklaver.com> writes:

On 5/18/24 07:48, Troels Arvin wrote:

Also, when I try to create a database with "en_US.utf8" as locale
without specifying a template:

troels=# create database test4 locale 'en_US.utf8';
ERROR:  new collation (en_US.utf8) is incompatible with the collation of
the template database (en_US.UTF-8)
HINT:  Use the same collation as in the template database, or use
template0 as template.

I'm going to say that is Postgres being exact to a fault.

Yeah. glibc will treat those two locale names as equivalent,
and I think most if not all other libc implementations do too.
But Postgres doesn't know that so it demands exact textual
equality before assuming two locale names are equivalent.

If this is getting in your way you could probably get away with
just UPDATE-ing pg_database to use whichever spelling you think is
preferable; the strings appearing in datcollate and datctype aren't
stored anywhere else. (But experiment in a scratch installation to
verify that ... and don't try changing them to something that you
don't know to be semantically equivalent.)

regards, tom lane