ICU Collations and Collation Updates
Good morning,
long time reader, first time writer.
Where I currently work my colleagues used libc collations before I
arrived. While using libc collations, they stumbled upon the collation
update problem after SLES updates (15.4 to 15.5) (collation version
difference for database and operating system) (paraphrased, don't have
the english message at the hand).
For an easy solution I suggested to switch to ICU collations. While
documenting the problem for older systems I realized that I did not
know enough about the problem to document why ICU collations would
solve this problem.
After reading https://www.postgresql.org/docs/17/collation.html this is
how I understand it:
When initdb creates a cluster the OS available collations are copied to
the database as database objects, listable using
select * from pg_collation;
Now, an OS collation update as part of the OS update will change the
collations available on the OS level, but not the collations that the
database uses.
Is my understanding correct then in that this way the database
collations never change, unless a manual intervention reinitialises the
collations and reindexes the database (or appropriate indexes)? How
does that process compare to other RDBMS?
Are regular collation updates deemed unnecessary for long running
database installations? Or do you people have maintenance workflows
that incorporate regular collation updates to the databases?
Thanks,
Thomas
On Mon, 2025-04-14 at 08:28 +0000, Thomas Michael Engelke wrote:
Where I currently work my colleagues used libc collations before I
arrived. While using libc collations, they stumbled upon the collation
update problem after SLES updates (15.4 to 15.5) (collation version
difference for database and operating system) (paraphrased, don't have
the english message at the hand).For an easy solution I suggested to switch to ICU collations. While
documenting the problem for older systems I realized that I did not
know enough about the problem to document why ICU collations would
solve this problem.After reading https://www.postgresql.org/docs/17/collation.html this is
how I understand it:When initdb creates a cluster the OS available collations are copied to
the database as database objects, listable usingselect * from pg_collation;
Now, an OS collation update as part of the OS update will change the
collations available on the OS level, but not the collations that the
database uses.Is my understanding correct then in that this way the database
collations never change, unless a manual intervention reinitialises the
collations and reindexes the database (or appropriate indexes)? How
does that process compare to other RDBMS?Are regular collation updates deemed unnecessary for long running
database installations? Or do you people have maintenance workflows
that incorporate regular collation updates to the databases?
PostgreSQL just copies the names and versions of the collations to the
catalog. The actual collating is done by the C or ICU library.
When you update the C library or ICU library and the version changes,
you get warned by PostgreSQL and have to rebuild indexes.
So the collations can change whenever you update the respective libraries.
You would have to build PostgreSQL yourself with a fixed version of ICU
that you never upgrade if you want to avoid the problem.
Or you start using the POSIX collation.
Yours,
Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes:
On Mon, 2025-04-14 at 08:28 +0000, Thomas Michael Engelke wrote:
Is my understanding correct then in that this way the database
collations never change, unless a manual intervention reinitialises the
collations and reindexes the database (or appropriate indexes)? How
does that process compare to other RDBMS?
When you update the C library or ICU library and the version changes,
you get warned by PostgreSQL and have to rebuild indexes.
So the collations can change whenever you update the respective libraries.
You would have to build PostgreSQL yourself with a fixed version of ICU
that you never upgrade if you want to avoid the problem.
Yeah. AIUI there are two things that ICU does better than libc here:
1. ICU has a fairly well-defined scheme for identifying collation
versions, glibc not so much. So the collation-changed warnings that
Laurenz mentions are a lot more trustworthy for ICU collations.
2. It's at least *possible* to use your own fixed-version ICU
library if you're desperate enough. I don't think that would work
too well for libc; you're stuck with what the platform provides.
regards, tom lane
Le Mon, 14 Apr 2025 10:36:40 -0400,
Tom Lane <tgl@sss.pgh.pa.us> a écrit :
[…]
2. It's at least *possible* to use your own fixed-version ICU
library if you're desperate enough. I don't think that would work
too well for libc; you're stuck with what the platform provides.
You're not so stuck with what the plateform provides as you can define your own
collation with glibc. Here a quick exemple with an ebcdic collation based on
glibc:
/messages/by-id/20230824162653.34eb3bad@karst
It's probably another level of maintenance to provide a stable UTF8 collation
using such a procedure though. But it seems possible at least.
Hi Tom, hi Laurenz
On 14 Apr 2025, at 16:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
You would have to build PostgreSQL yourself with a fixed version of ICU
that you never upgrade if you want to avoid the problem.
[...]
2. It's at least *possible* to use your own fixed-version ICU
library if you're desperate enough. I don't think that would work
too well for libc; you're stuck with what the platform provides.
That topic is interesting because I have a huge problem finding a downtime window for our applications to rebuild after the SLES upgrades. I am in the process of slowly changing everything to ICU. But limiting downtime is essential for me.
We always build the PostgreSQL software from source, so if there's a way to bake the libicu directly into the software to never change it again (beside from recompiling of course), even when building new PostgreSQL versions, I'd very much appreciate if if you could let me know how I would do that.
The necessity for reindex is a huge problem for us.
Cheers,
Paul
On Mon, 2025-04-14 at 19:24 +0200, Paul Foerster wrote:
Hi Tom, hi Laurenz
On 14 Apr 2025, at 16:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
You would have to build PostgreSQL yourself with a fixed version of ICU
that you never upgrade if you want to avoid the problem.[...]
2. It's at least *possible* to use your own fixed-version ICU
library if you're desperate enough. I don't think that would work
too well for libc; you're stuck with what the platform provides.That topic is interesting because I have a huge problem finding a downtime
window for our applications to rebuild after the SLES upgrades. I am in the
process of slowly changing everything to ICU. But limiting downtime is
essential for me.We always build the PostgreSQL software from source, so if there's a way
to bake the libicu directly into the software to never change it again
(beside from recompiling of course), even when building new PostgreSQL
versions, I'd very much appreciate if if you could let me know how I would
do that.The necessity for reindex is a huge problem for us.
You cannot "bake in into" PostgreSQL, but you can grab the ICU source,
install it in /usr/local or similar and build PostgreSQL against that.
You will have to fiddle with CFLAGS and LDFLAGS so that the build process
uses the C headers and libraries from your build of the ICU library.
Yours,
Laurenz Albe
Hi Laurenz,
On 14 Apr 2025, at 19:36, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
You cannot "bake in into" PostgreSQL, but you can grab the ICU source,
install it in /usr/local or similar and build PostgreSQL against that.
You will have to fiddle with CFLAGS and LDFLAGS so that the build process
uses the C headers and libraries from your build of the ICU library.
I think I'm going to look into this. I'm not a developer, so I'll probably fail. Building PostgreSQL from source is easy and well documented. I have scripted this. But other than that, I'll probably fail. 🤣 Still worth a try, though.
Thanks,
Paul