Custom Glibc collation version strings under LOCPATH
Hi,
One way to move to a newer glibc-based Linux distribution but keep the
locales working the same* without keeping the associated zombie C code
alive is to find the source system's collation definition source
files, compile them with the localedef on the target system and point
to the top-level directory with the environment variable LOCPATH.
That runs directly into the naivity of commit d5ac14f9's
gnu_get_libc_version() kludge. So here's a patch that allows a brave
user of that recompilation technique to drop a custom version string
into a file called one of:
* $LOCPATH/<collcollate>/LC_COLLATE.version
* $LOCPATH/<collcollate>/version
* $LOCPATH/LC_COLLATE.version
* $LOCPATH/version
This way you can make your custom locales' reported version agree with
wherever they came from to skip those mismatch warnings, at whichever
granularity suits you. Or you can design some other scheme for
labeling versions. The attached POC shows this working, though it
lacks documentation for now as I wanted to float the general idea
first.
My preference would be for a tool-supported way for locale components
to report their own version with a new API[1]https://www.mail-archive.com/austin-group-l@opengroup.org/msg12849.html, and I hope that someone
might eventually consider writing and proposing a patch to glibc for
that. But in the meantime, I figured that users willing to compile
their own locale definitions for PostgreSQL's benefit might want to
drop their own version string into a text file. The patch has no
effect otherwise, except for a few rare and harmless open() -> ENOENT
system calls if you have defined LOCPATH without supplying a custom
version file.
Returning gnu_get_libc_version() when you set LOCPATH is arguably a
bug and should at the very least be suppressed, I think.
*Of course you have to make sure you know what you're doing. For
example we learned on this list of some tricky edge cases, mainly
around the treatment of Unicode-order sequences for eg C.UTF-8 which
began as buggy local patches in some distros' glibc C code, but at
least that case has been removed from our problem space by the new
built-in provider. I'm interested in hearing about other concrete
examples of the locale-recompilation technique failing to be perfect,
and getting to the bottom of them; I have yet to hear of a real world
system that fails amcheck when using locale definitions ported in this
way.
[1]: https://www.mail-archive.com/austin-group-l@opengroup.org/msg12849.html
Attachments:
0001-Load-optional-collation-version-from-glibc-LOCPATH.patchtext/x-patch; charset=US-ASCII; name=0001-Load-optional-collation-version-from-glibc-LOCPATH.patchDownload
From ab504665cc51814bbe0d8757d35e331fd9b6a41a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 4 Jun 2025 12:19:53 +1200
Subject: [PATCH] Load optional collation version from glibc LOCPATH.
One technique for dealing with glibc locale definition changes across
Linux distribution upgrades or migrations is to compile the locale
definitions from the source system with the target system's localedef
tool, and then point to the newly compiled locales with the LOCPATH
environment variable, with certain caveats.
Unfortunately this breaks the rather simple-minded approach in commit
d5ac14f9, which reports the new system's glibc library version for lack
of anything better. Spurious warnings about mismatched collation are
reported, defeating the goal of the LOCPATH-based upgrade/migration
technique.
Since neither POSIX nor glibc defines a way for locales to report their
version (cf FreeBSD querylocale(), ICU ucol_getVersion(), Windows
GetNLSVersionEx()), invent a way for a user of the LOCPATH technique to
supply user-defined version information. This can be used to store the
version string of the source system, or invent a new convention for
labeling collation versions. The version is read from the first text
file found in this list:
* $LOCPATH/<collcollate>/LC_COLLATE.version
* $LOCPATH/<collcollate>/version
* $LOCPATH/LC_COLLATE.version
* $LOCPATH/version
Only if none of these files are found will the glibc library's reported
version be used, as before. There is no change in behaviour for most
users, since LOCPATH is not normally defined. Non-glibc builds are not
affected. glibc itself has no knowledge of these files, which are a new
PostgreSQL invention that relies only on knowledge of how glibc's
compiled locales are laid out, including "normalization" of the codeset
part of the <collcollate> part.
Back-patch to 13, where d5ac14f9 shipped.
---
src/backend/main/main.c | 2 +
src/backend/utils/adt/pg_locale_libc.c | 126 ++++++++++++++++++++++++-
src/include/utils/pg_locale.h | 1 +
3 files changed, 127 insertions(+), 2 deletions(-)
diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index 7d63cf94a6b..7dd7060c93a 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -113,6 +113,8 @@ main(int argc, char *argv[])
MyProcPid = getpid();
MemoryContextInit();
+ pg_locale_init_libc();
+
/*
* Set reference point for stack-depth checking. (There's no point in
* enabling this before error reporting works.)
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 199857e22db..c6403dda2c8 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -19,6 +19,7 @@
#include "catalog/pg_collation.h"
#include "mb/pg_wchar.h"
#include "miscadmin.h"
+#include "storage/fd.h"
#include "utils/builtins.h"
#include "utils/formatting.h"
#include "utils/memutils.h"
@@ -33,6 +34,10 @@
#include <shlwapi.h>
#endif
+#ifdef __GLIBC__
+static char *glibc_locpath = NULL;
+#endif
+
/*
* Size of stack buffer to use for string transformations, used to avoid heap
* allocations in typical cases. This should be large enough that most strings
@@ -119,6 +124,23 @@ static const struct collate_methods collate_methods_libc_win32_utf8 = {
};
#endif
+void
+pg_locale_init_libc(void)
+{
+#ifdef __GLIBC__
+ /* https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html */
+ const char *locpath = getenv("LOCPATH");
+
+ /*
+ * Capture a copy of LOCPATH so that get_collation_actual_version_libc()
+ * can find optional user-supplied custom version strings when using a
+ * non-default tree of locale files.
+ */
+ if (locpath)
+ glibc_locpath = MemoryContextStrdup(TopMemoryContext, locpath);
+#endif
+}
+
size_t
strlower_libc(char *dst, size_t dstsize, const char *src,
ssize_t srclen, pg_locale_t locale)
@@ -655,6 +677,43 @@ strnxfrm_libc(char *dest, size_t destsize, const char *src, ssize_t srclen,
return result;
}
+#if defined(__GLIBC__)
+/*
+ * Read version string from user-supplied file into a palloc'd string with any
+ * trailing whitespace removed. Return NULL if the file doesn't exist.
+ */
+static char *
+read_collversion_from_file(const char *path)
+{
+ char buffer[TEXTBUFLEN];
+ ssize_t size;
+ int fd;
+
+ fd = OpenTransientFile(path, O_RDONLY);
+ if (fd < 0)
+ {
+ if (errno == ENOENT)
+ return NULL;
+ ereport(ERROR, errmsg("could not open file \"%s\": %m", path));
+ }
+ size = read(fd, buffer, sizeof(buffer) - 1);
+ if (size < 0)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+ errno = save_errno;
+ ereport(ERROR, errmsg("could not read from file \"%s\": %m", path));
+ }
+ while (size > 0 && isspace((unsigned char) buffer[size - 1]))
+ size--;
+ buffer[size] = 0;
+ CloseTransientFile(fd);
+
+ return pstrdup(buffer);
+}
+#endif
+
char *
get_collation_actual_version_libc(const char *collcollate)
{
@@ -665,8 +724,71 @@ get_collation_actual_version_libc(const char *collcollate)
pg_strcasecmp("POSIX", collcollate) != 0)
{
#if defined(__GLIBC__)
- /* Use the glibc version because we don't have anything better. */
- collversion = pstrdup(gnu_get_libc_version());
+
+ /*
+ * If the user defined the environment variable LOCPATH (a glibc
+ * extension) to override the search location for locale definitions,
+ * perhaps pointing to definitions compiled from another distribution
+ * or version of glibc as part of an upgrade strategy, provide a way
+ * for the reported version string to be loaded from
+ * $LOCPATH/{collcollate}/LC_COLLATE.version, ../version, or the same
+ * names at top level in $LOCPATH.
+ *
+ * This convention is a PostgreSQL invention not known to glibc.
+ * Neither glibc nor POSIX provides a way to store or query a version
+ * string inside locale components themselves.
+ */
+ if (glibc_locpath)
+ {
+ char collcollate_dir[LOCALE_NAME_BUFLEN];
+ char pathname[MAXPGPATH];
+ char *p;
+
+ /* lower-case and digits only in codeset part, .UTF-8 -> .utf8 */
+ snprintf(collcollate_dir, sizeof(collcollate_dir), "%s",
+ collcollate);
+ p = strchr(collcollate_dir, '.');
+ if (p)
+ {
+ ++p;
+ while (*p)
+ {
+ if (!isalnum(*p))
+ {
+ memmove(p, p + 1, strlen(p)); /* counts terminator */
+ continue;
+ }
+ *p = tolower((unsigned char) *p);
+ ++p;
+ }
+ }
+
+ snprintf(pathname, sizeof(pathname), "%s/%s/LC_COLLATE.version",
+ glibc_locpath, collcollate_dir);
+ collversion = read_collversion_from_file(pathname);
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/%s/version",
+ glibc_locpath, collcollate_dir);
+ collversion = read_collversion_from_file(pathname);
+ }
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/LC_COLLATE.version",
+ glibc_locpath);
+ collversion = read_collversion_from_file(pathname);
+ }
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/version",
+ glibc_locpath);
+ collversion = read_collversion_from_file(pathname);
+ }
+ }
+
+ /* Use the glibc version if we don't have anything better. */
+ if (collversion == NULL)
+ collversion = pstrdup(gnu_get_libc_version());
#elif defined(LC_VERSION_MASK)
locale_t loc;
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 7b8cbf58d2c..820937ef062 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -121,6 +121,7 @@ struct pg_locale_struct
} info;
};
+extern void pg_locale_init_libc(void);
extern void init_database_collation(void);
extern pg_locale_t pg_newlocale_from_collation(Oid collid);
--
2.39.5
On 04.06.25 06:03, Thomas Munro wrote:
One way to move to a newer glibc-based Linux distribution but keep the
locales working the same* without keeping the associated zombie C code
alive is to find the source system's collation definition source
files, compile them with the localedef on the target system and point
to the top-level directory with the environment variable LOCPATH.That runs directly into the naivity of commit d5ac14f9's
gnu_get_libc_version() kludge. So here's a patch that allows a brave
user of that recompilation technique to drop a custom version string
into a file called one of:* $LOCPATH/<collcollate>/LC_COLLATE.version
* $LOCPATH/<collcollate>/version
* $LOCPATH/LC_COLLATE.version
* $LOCPATH/version
Nice idea.
The patch looks mostly straightforward.
I wonder why you want to capture LOCPATH early in main.c. It seems
sufficient to look it up when needed?
On Wed, Jun 4, 2025 at 9:17 PM Peter Eisentraut <peter@eisentraut.org> wrote:
I wonder why you want to capture LOCPATH early in main.c. It seems
sufficient to look it up when needed?
Right, it is setenv() that we're trying to avoid. Updated.
Attachments:
v2-0001-Load-optional-collation-version-from-glibc-LOCPAT.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Load-optional-collation-version-from-glibc-LOCPAT.patchDownload
From 5482ccf5a61061411f9a996da84f14471b791d83 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 4 Jun 2025 12:19:53 +1200
Subject: [PATCH v2] Load optional collation version from glibc LOCPATH.
One technique for dealing with glibc locale definition changes across
Linux distribution upgrades or migrations is to compile the locale
definitions from the source system with the target system's localedef
tool, and then point to the newly compiled locales with the LOCPATH
environment variable, with certain caveats.
Unfortunately this breaks the rather simple-minded approach in commit
d5ac14f9, which reports the new system's glibc library version for lack
of anything better. Spurious warnings about mismatched collation are
reported, defeating the goal of the LOCPATH-based upgrade/migration
technique.
Since neither POSIX nor glibc defines a way for locales to report their
version (cf FreeBSD querylocale(), ICU ucol_getVersion(), Windows
GetNLSVersionEx()), invent a way for a user of the LOCPATH technique to
supply the version. This can be used to store the version string of the
source system, or to invent a new convention for labeling collation
versions. The version is read from the first text file found in this
list:
* $LOCPATH/<collcollate>/LC_COLLATE.version
* $LOCPATH/<collcollate>/version
* $LOCPATH/LC_COLLATE.version
* $LOCPATH/version
Only if none of these files are found will the glibc library version be
used, as before. There is no change in behavior for most users, since
LOCPATH is not normally defined. Non-glibc builds are not affected.
glibc itself has no knowledge of these files, which are a PostgreSQL
invention that relies only on knowledge of how glibc's compiled locales
are laid out, including "normalization" of the codeset part of the
<collcollate> part.
Back-patch to 13, where d5ac14f9 shipped.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CA%2BhUKG%2BUngA4H%3DYtsz6iiz_xAzqG3JX9eC9CBSzpubfRz9gYeQ%40mail.gmail.com
---
src/backend/utils/adt/pg_locale_libc.c | 106 ++++++++++++++++++++++++-
1 file changed, 104 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 199857e22db..a09492c05e7 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -19,6 +19,7 @@
#include "catalog/pg_collation.h"
#include "mb/pg_wchar.h"
#include "miscadmin.h"
+#include "storage/fd.h"
#include "utils/builtins.h"
#include "utils/formatting.h"
#include "utils/memutils.h"
@@ -655,6 +656,43 @@ strnxfrm_libc(char *dest, size_t destsize, const char *src, ssize_t srclen,
return result;
}
+#if defined(__GLIBC__)
+/*
+ * Read version string from user-supplied file into a palloc'd string with any
+ * trailing whitespace removed. Return NULL if the file doesn't exist.
+ */
+static char *
+read_collversion_from_file(const char *path)
+{
+ char buffer[TEXTBUFLEN];
+ ssize_t size;
+ int fd;
+
+ fd = OpenTransientFile(path, O_RDONLY);
+ if (fd < 0)
+ {
+ if (errno == ENOENT)
+ return NULL;
+ ereport(ERROR, errmsg("could not open file \"%s\": %m", path));
+ }
+ size = read(fd, buffer, sizeof(buffer) - 1);
+ if (size < 0)
+ {
+ int save_errno = errno;
+
+ CloseTransientFile(fd);
+ errno = save_errno;
+ ereport(ERROR, errmsg("could not read from file \"%s\": %m", path));
+ }
+ while (size > 0 && isspace((unsigned char) buffer[size - 1]))
+ size--;
+ buffer[size] = 0;
+ CloseTransientFile(fd);
+
+ return pstrdup(buffer);
+}
+#endif
+
char *
get_collation_actual_version_libc(const char *collcollate)
{
@@ -665,8 +703,72 @@ get_collation_actual_version_libc(const char *collcollate)
pg_strcasecmp("POSIX", collcollate) != 0)
{
#if defined(__GLIBC__)
- /* Use the glibc version because we don't have anything better. */
- collversion = pstrdup(gnu_get_libc_version());
+ char *locpath;
+
+ /*
+ * If the user defined the environment variable LOCPATH (a glibc
+ * extension) to override the search location for locale definitions,
+ * perhaps pointing to definitions compiled from another distribution
+ * or version of glibc as part of an upgrade strategy, provide a way
+ * for the reported version string to be loaded from
+ * $LOCPATH/{collcollate}/LC_COLLATE.version, ../version, or the same
+ * names at top level in $LOCPATH.
+ *
+ * This convention is a PostgreSQL invention not known to glibc.
+ * Neither glibc nor POSIX provides a way to store or query a version
+ * string inside locale components themselves.
+ */
+ if ((locpath = getenv("LOCPATH")))
+ {
+ char collcollate_dir[LOCALE_NAME_BUFLEN];
+ char pathname[MAXPGPATH];
+ char *p;
+
+ /* lower-case and digits only in codeset part, .UTF-8 -> .utf8 */
+ snprintf(collcollate_dir, sizeof(collcollate_dir), "%s",
+ collcollate);
+ p = strchr(collcollate_dir, '.');
+ if (p)
+ {
+ ++p;
+ while (*p)
+ {
+ if (!isalnum(*p))
+ {
+ memmove(p, p + 1, strlen(p)); /* counts terminator */
+ continue;
+ }
+ *p = tolower((unsigned char) *p);
+ ++p;
+ }
+ }
+
+ snprintf(pathname, sizeof(pathname), "%s/%s/LC_COLLATE.version",
+ locpath, collcollate_dir);
+ collversion = read_collversion_from_file(pathname);
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/%s/version",
+ locpath, collcollate_dir);
+ collversion = read_collversion_from_file(pathname);
+ }
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/LC_COLLATE.version",
+ locpath);
+ collversion = read_collversion_from_file(pathname);
+ }
+ if (collversion == NULL)
+ {
+ snprintf(pathname, sizeof(pathname), "%s/version",
+ locpath);
+ collversion = read_collversion_from_file(pathname);
+ }
+ }
+
+ /* Use the glibc version if we don't have anything better. */
+ if (collversion == NULL)
+ collversion = pstrdup(gnu_get_libc_version());
#elif defined(LC_VERSION_MASK)
locale_t loc;
--
2.39.5
On 6/4/25 00:03, Thomas Munro wrote:
One way to move to a newer glibc-based Linux distribution but keep the
locales working the same* without keeping the associated zombie C code
alive is to find the source system's collation definition source
files, compile them with the localedef on the target system and point
to the top-level directory with the environment variable LOCPATH.
I don't think this works in all cases because I have seen where sorting
was affected by C code rather than than data changes.
--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com
On 6/4/25 09:52, Joe Conway wrote:
On 6/4/25 00:03, Thomas Munro wrote:
One way to move to a newer glibc-based Linux distribution but keep the
locales working the same* without keeping the associated zombie C code
alive is to find the source system's collation definition source
files, compile them with the localedef on the target system and point
to the top-level directory with the environment variable LOCPATH.I don't think this works in all cases because I have seen where sorting
was affected by C code rather than than data changes.
Sorry I missed this part:
I'm interested in hearing about other concrete
examples of the locale-recompilation technique failing to be perfect,
and getting to the bottom of them; I have yet to hear of a real world
system that fails amcheck when using locale definitions ported in this
way.
If you go from anything pre-glibc-2.21 to post-glibc-2.21 I think you
will find that even with the same data files you get a different sort.
The same patch that caused the performance regression [1]https://sourceware.org/bugzilla/show_bug.cgi?id=18441 (still present
in up to date glibc) also cause changes in sort order via C code alone.
[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=18441
--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com
On Thu, Jun 5, 2025 at 3:44 AM Joe Conway <mail@joeconway.com> wrote:
On 6/4/25 09:52, Joe Conway wrote:
On 6/4/25 00:03, Thomas Munro wrote:
I'm interested in hearing about other concrete
examples of the locale-recompilation technique failing to be perfect,
and getting to the bottom of them; I have yet to hear of a real world
system that fails amcheck when using locale definitions ported in this
way.If you go from anything pre-glibc-2.21 to post-glibc-2.21 I think you
will find that even with the same data files you get a different sort.
The same patch that caused the performance regression [1] (still present
in up to date glibc) also cause changes in sort order via C code alone.
Will try. And BTW I fully understand that your work on running parts
of pinned old glibc libraries is a bug-perfect solution to this. But
I also want to explore other trade-off positions, for users who don't
want to run unmaintained C code. In exchange for that paranoia you
have C code changes, intentional or unintentional, and I'd really like
to understand them better... One thing that is definitely out of the
question is moving the compiled LC_COLLATE files between glibc
versions (the binary format clearly changes, sometimes it apparently
work, sometimes it doesn't at all). That leads to the idea of
recompiling with localedef. The source formats are standardised by
POSIX and *should* have the same meaning to any system, so now maybe
we're only talking about bugs (in theory, you should even be able to
move the source between unrelated Unixen, but I only care about glibc
here, and I have no doubt that there are extensions and quirks so
reality may fail to live up to the theory completely). I've
personally analysed only one such case and chased it all the way down,
which is the support for strict codepoint ordering and the non-strict
local fudges that Debian et al shipped in some version range, so we
can't even really blame it on glibc, and yet it is/was in the wild so
we can't ignore it (thanks to Jeff for making that one irrelevant).
Finding more cases probably involves running something a little like
Jeremy's torture tests across a huge gallery of versions and
combinations of cross-version recompiled definitions. Or something
like that...
On 6/4/25 19:35, Thomas Munro wrote:
On Thu, Jun 5, 2025 at 3:44 AM Joe Conway <mail@joeconway.com> wrote:
If you go from anything pre-glibc-2.21 to post-glibc-2.21 I think you
will find that even with the same data files you get a different sort.
The same patch that caused the performance regression [1] (still present
in up to date glibc) also cause changes in sort order via C code alone.
Finding more cases probably involves running something a little like
Jeremy's torture tests across a huge gallery of versions and
combinations of cross-version recompiled definitions. Or something
like that...
Sounds like great fun!
;-)
--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com